Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from RabbitMQ lost connection #291

Closed
gothub opened this issue Sep 20, 2021 · 3 comments
Closed

Recover from RabbitMQ lost connection #291

gothub opened this issue Sep 20, 2021 · 3 comments
Milestone

Comments

@gothub
Copy link
Contributor

gothub commented Sep 20, 2021

The k8s restarts mentioned in this k8s issue appear to be affecting the communication between RabbitMQ and metadig-worker. I'm not sure if this is the direct cause, but in any case, metadig-worker needs to be able to re-establish connection to RabbitMQ when it is lost. Metadig-engine prints out the following message to it's log, then it becomes unresponsive and will not handle new requests, as it appears RabbitMQ isn't talking to it any longer:

20210920-22:49:18: [INFO]: Total elapsed time (seconds): 1876 for metadataPid: urn:uuid:566a5ac2-4a01-4e6a-866e-986e13dc633b, suiteId: FAIR-suite-0.3.1
 [edu.ucsb.nceas.mdqengine.Worker:346]
20210920-22:49:18: [INFO]:  [x] Done [edu.ucsb.nceas.mdqengine.Worker:356]
20210920-22:49:18: [INFO]:  [x] Sent completed report for pid: 'urn:uuid:566a5ac2-4a01-4e6a-866e-986e13dc633b' [edu.ucsb.nceas.mdqengine.Worker:358]
ERROR ForgivingExceptionHandler Consumer edu.ucsb.nceas.mdqengine.Worker$1@53487557 (amq.ctag-R4BxQVhnclRPEOOGHtvDJw) method handleDelivery for channel AMQChannel(amqp://[email protected]:5672/,1) threw an exception for channel AMQChannel(amqp://[email protected]:5672/,1)
 com.rabbitmq.client.AlreadyClosedException: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - consumer ack timed out on channel 1, class-id=0, method-id=0)
        at com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(AMQChannel.java:258)
        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:427)
        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:421)
        at com.rabbitmq.client.impl.recovery.RecoveryAwareChannelN.basicAck(RecoveryAwareChannelN.java:93)
        at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.basicAck(AutorecoveringChannel.java:428)
        at edu.ucsb.nceas.mdqengine.Worker$1.handleDelivery(Worker.java:315)
        at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
        at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:104)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


TODO: look into how to re-establish connection with RabbitMQ from a client and why the 'ack' is timing out.

@gothub
Copy link
Contributor Author

gothub commented Sep 23, 2021

The heartbeat capability of RabbitMQ can be used by the client (metadig-worker) to determine when the connection to RabbitMQ has been lost, so that it can be reestablished. Note that any operation from the client to the server (RabbitMQ) can be used for this purpose, so errors should be caught when sending the 'ack' from metadig-worker to RabbitMQ and the connection re-established if needed.

RabbitMQ heartbeats are described here.

@gothub
Copy link
Contributor Author

gothub commented Sep 24, 2021

After further investigation, it's clear that the lost connection is due to a worker running for too long and exceeding the RabbitMQ heartbeat timeout (see updated worker log above).

The long running workers appear to happen when many workers are running and all workers are busy. After viewing several worker logs, in all cases the worker completes creating the assessment and indexing it, but then an error is generated when the worker sends the 'ack' to tell RabbitMQ it's ready to receive another queue entry.

gothub added a commit that referenced this issue Sep 24, 2021
Issue #291 Recover from lost RabbitMQ connection
@gothub
Copy link
Contributor Author

gothub commented Sep 24, 2021

metadig-worker has been updated to reestablish the connection to RabbitMQ if it has been lost, and resend the 'completed' message back to the controller if necessary, in commit 8092612

@gothub gothub closed this as completed Sep 24, 2021
@gothub gothub added this to the 2.4.0 milestone Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant