close
The Wayback Machine - https://web.archive.org/web/20201230201625/https://github.com/gliderlabs/registrator/issues/621
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in resync when a deregister occurs? #621

Open
CpuID opened this issue Jun 12, 2018 · 9 comments
Open

Race condition in resync when a deregister occurs? #621

CpuID opened this issue Jun 12, 2018 · 9 comments

Comments

@CpuID
Copy link

@CpuID CpuID commented Jun 12, 2018

  • What version of docker are you running?
root@ip-10-202-10-253:~# docker version
Client:
 Version:      18.05.0-ce
...
Server:
 Engine:
  Version:      18.05.0-ce
  • What version of registrator are you running?
2018/05/18 06:33:06 Starting registrator v7 ...
2018/05/18 06:33:06 Forcing host IP to 10.202.10.253
2018/05/18 06:33:06 Using consul adapter: consul://172.17.0.1:8500
  • Did you build a custom version of registrator? If so, what is that image?

Using gliderlabs/registrator:v7 with a bash wrapper to start it, the relevant parts at the bottom being:

# Pass on SIGTERM to registrator when received. Sleep 2 seconds after the SIGTERM to allow time
# to perform any cleanups.
trap 'echo "SIGTERM received, initiate graceful shutdown."; kill -TERM $PID; sleep 2' SIGTERM

full_command="/bin/registrator -retry-attempts 5 -retry-interval 1000 -resync 30 -ip ${get_private_ip} -tags 'az-${az_suffix}' consul://${consul_ip}:8500"
echo "$(date) : Registrator Start Command - ${full_command}"
eval "${full_command}"
PID=$!
wait $PID
  • What is the exact command you are running registrator with?

See above

  • What is the exact command you are running your container with?

/registrator_start.sh, the bash wrapper.

  • A log capture of all the docker events before, during, and after the issue

Tried to get them but they seem to be rotated out unfortunately :(

  • If relevant, Dockerfile for application that is having issues.

Description of the problem:

  • It seems like the resync behaviour is racy, and possibly needs a mutex/lock between unregister events + resync events?
  • Looking at the consul logs, it looks as though a deregister occurred, but then a resync occurred and readded a recently deregistered service instance, within the same second?
  • This then leads to a stale/orphaned entry with a zero ServicePort which is never cleaned up, including on a resync

How reproducible:

  • Rare, feels racy. Hit it 4~ times in the past week (that we know of), with low hundred services and high hundreds of containers across 4 clusters.

Steps to Reproduce:

  • Run a ton of services, use resync, shorter interval = more regular = more occurrences?

Actual Results:

Note the Synced response after the Deregistered in this (grepped by a088d8cbf79df4f9d001):

    2018/06/12 13:04:06 [INFO] agent: Synced service "0f3bbd5258ff:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-a088d8cbf79df4f9d001:80"
    2018/06/12 13:04:36 [INFO] agent: Synced service "0f3bbd5258ff:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-a088d8cbf79df4f9d001:80"
    2018/06/12 13:05:06 [INFO] agent: Synced service "0f3bbd5258ff:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-a088d8cbf79df4f9d001:80"
    2018/06/12 13:05:36 [INFO] agent: Deregistered service "0f3bbd5258ff:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-a088d8cbf79df4f9d001:80"
    2018/06/12 13:05:36 [INFO] agent: Synced service "0f3bbd5258ff:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-a088d8cbf79df4f9d001:80"

And the consul service list for the IP in question. The first result is the stale entry (note no ServicePort), the second is a valid replacement service that was spawned by ECS during a rolling deploy, and is correctly registered:

  {
    "ID": "1fee1507-19a7-0d06-e41c-31d32167b999",
    "Node": "a68f43f934db",
    "Address": "10.202.10.253",
    "Datacenter": "aws-us-e1-production",
    "TaggedAddresses": {
      "lan": "10.202.10.253",
      "wan": "10.202.10.253"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "0f3bbd5258ff:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-a088d8cbf79df4f9d001:80",
    "ServiceName": "recommendation-engine-api-node",
    "ServiceTags": [
      "web-backend",
      "az-1d"
    ],
    "ServiceAddress": "10.202.10.253",
    "ServicePort": 0,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 31357420,
    "ModifyIndex": 31357420
  },
  {
    "ID": "1fee1507-19a7-0d06-e41c-31d32167b999",
    "Node": "a68f43f934db",
    "Address": "10.202.10.253",
    "Datacenter": "aws-us-e1-production",
    "TaggedAddresses": {
      "lan": "10.202.10.253",
      "wan": "10.202.10.253"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "0f3bbd5258ff:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-f6e7fbc9dacee7db0200:80",
    "ServiceName": "recommendation-engine-api-node",
    "ServiceTags": [
      "web-backend",
      "az-1d"
    ],
    "ServiceAddress": "10.202.10.253",
    "ServicePort": 32929,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 31429652,
    "ModifyIndex": 31429652
  },

Expected Results:

  • Let the service deregister properly.
@CpuID
Copy link
Author

@CpuID CpuID commented Jun 12, 2018

Confirmed same consul log output on a second instance (a day earlier):

    2018/06/11 18:58:34 [INFO] agent: Synced service "38b2f9af21ea:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-acc5bbcebdbaeadad301:80"
    2018/06/11 18:59:04 [INFO] agent: Synced service "38b2f9af21ea:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-acc5bbcebdbaeadad301:80"
    2018/06/11 18:59:34 [INFO] agent: Synced service "38b2f9af21ea:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-acc5bbcebdbaeadad301:80"
    2018/06/11 19:00:04 [INFO] agent: Deregistered service "38b2f9af21ea:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-acc5bbcebdbaeadad301:80"
    2018/06/11 19:00:04 [INFO] agent: Synced service "38b2f9af21ea:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-acc5bbcebdbaeadad301:80"
@CpuID
Copy link
Author

@CpuID CpuID commented Jun 12, 2018

Third instance, an hour ago:

    2018/06/12 20:07:31 [INFO] agent: Synced service "b371b03b211c:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-8ac7b7cde09ef3fe2800:80"
    2018/06/12 20:08:01 [INFO] agent: Synced service "b371b03b211c:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-8ac7b7cde09ef3fe2800:80"
    2018/06/12 20:08:31 [INFO] agent: Deregistered service "b371b03b211c:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-8ac7b7cde09ef3fe2800:80"
    2018/06/12 20:08:31 [INFO] agent: Synced service "b371b03b211c:ecs-recommendation-engine-api-node-production-16-recommendation-engine-api-node-nginx-8ac7b7cde09ef3fe2800:80"
@CpuID
Copy link
Author

@CpuID CpuID commented Jun 12, 2018

Pattern: all 3 Deregister + Synced operations are on the same second, backing up my theory re the race condition. Running in parallel goroutines most likely?

@CpuID
Copy link
Author

@CpuID CpuID commented Jun 19, 2018

Hit this again today...

@CpuID
Copy link
Author

@CpuID CpuID commented Jun 19, 2018

Note: we now use a 300sec resync interval, and still managed to hit it.

@josegonzalez
Copy link
Member

@josegonzalez josegonzalez commented Jun 19, 2018

Do you happen to have a PR to fix this? I don't have a bunch of time to investigate/write a fix, but would be happy to review/merge something :)

@CpuID
Copy link
Author

@CpuID CpuID commented Jun 19, 2018

Do you happen to have a PR to fix this? I don't have a bunch of time to investigate/write a fix, but would be happy to review/merge something :)

I don't as yet, it's on my list to try and find time to a) write a test that repro's it and b) fix the underlying issue.

@CpuID
Copy link
Author

@CpuID CpuID commented Jan 15, 2019

Hit this again today, still haven't had time to PR :(

@CpuID
Copy link
Author

@CpuID CpuID commented Apr 19, 2019

Note: doing further testing on master, as I realised v7 is quite old and a lot of new fixes have been applied. Will report back once I have any definitive results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.