-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from RabbitMQ lost connection #291
Comments
The heartbeat capability of RabbitMQ can be used by the client (metadig-worker) to determine when the connection to RabbitMQ has been lost, so that it can be reestablished. Note that any operation from the client to the server (RabbitMQ) can be used for this purpose, so errors should be caught when sending the 'ack' from metadig-worker to RabbitMQ and the connection re-established if needed. RabbitMQ heartbeats are described here. |
After further investigation, it's clear that the lost connection is due to a worker running for too long and exceeding the RabbitMQ heartbeat timeout (see updated worker log above). The long running workers appear to happen when many workers are running and all workers are busy. After viewing several worker logs, in all cases the worker completes creating the assessment and indexing it, but then an error is generated when the worker sends the 'ack' to tell RabbitMQ it's ready to receive another queue entry. |
Issue #291 Recover from lost RabbitMQ connection
metadig-worker has been updated to reestablish the connection to RabbitMQ if it has been lost, and resend the 'completed' message back to the controller if necessary, in commit 8092612 |
The k8s restarts mentioned in this k8s issue appear to be affecting the communication between RabbitMQ and metadig-worker. I'm not sure if this is the direct cause, but in any case, metadig-worker needs to be able to re-establish connection to RabbitMQ when it is lost. Metadig-engine prints out the following message to it's log, then it becomes unresponsive and will not handle new requests, as it appears RabbitMQ isn't talking to it any longer:
TODO: look into how to re-establish connection with RabbitMQ from a client and why the 'ack' is timing out.
The text was updated successfully, but these errors were encountered: