Fix coordinator process exiting due to heartbeat race #1

laurglia · 2024-05-08T11:01:10Z

Brod group coordinator periodically sends heartbeats to the Kafka broker. If it does not receive a response to a request within configured timeout, it exits with hb_timeout reason.

There was a race condition where the connection to the Kafka broker was closed after a heartbeat was sent out, but before a heartbeat response was received. When this happened, brod still expected to receive a response to the heartbeat. But since the connection had closed, this response never came and the process exited with hb_timeout.

This error consistently happens once in an hour in all our Elixir deployments that use brod. It looks like that for some reason Amazon MSK closes the Kafka connection from the broker side every 1 hour, and for some reason always after the client sends a heartbeat request. I do not know why this happens, but regardless, the server has a right to close the connection and the application should be able to handle that without causing error noise.

This commit fixes the race condition. Now, when the connection goes down, we remove the reference to the heartbeat request that was last sent out. By removing this reference, the coordinator will no longer expect a response to the heartbeat request. Should connection be re-established, the coordinator will start sending out new heartbeat requests as usual.

I tested out the solution in my own computer by adding a custom TCP proxy in front of Kafka where I had ability to terminate the connections and introduce additional latency. With this setup, I was able to verify that with the previous version, the same errors that we saw in production happened, but with the changes they no longer showed up.

These are the errors that showed up in our logs:

Process #PID<0.19777.11> terminating
** (exit) :hb_timeout
    (stdlib 4.2) gen_server.erl:1241: :gen_server.handle_common_reply/8
    (stdlib 4.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Initial Call: :brod_group_coordinator.init/1
Ancestors: [#PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]
Neighbours:
    #PID<0.6845.12>
        Initial Call: :kpro_connection.init/4
        Current Call: :kpro_connection.loop/2
        Ancestors: [#PID<0.19777.11>, #PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>]

GenServer #PID<0.1262.11> terminating
** (stop) :hb_timeout
Last message: :lo_cmd_send_heartbeat

XT-19

Brod group coordinator periodically sends heartbeats to the Kafka broker. If it does not receive a response to a request within configured timeout, it exits with `hb_timeout` reason. There was a race condition where the connection to the Kafka broker was closed after a heartbeat was sent out, but before a heartbeat response was received. When this happened, brod still expected to receive a response to the heartbeat. But since the connection had closed, this response never came and the process exited with `hb_timeout`. This error consistently happens once in an hour in all our Elixir deployments that use brod. It looks like that for some reason Amazon MSK closes the Kafka connection from the broker side every 1 hour, and for some reason always after the client sends a heartbeat request. I do not know why this happens, but regardless, the server has a right to close the connection and the application should be able to handle that without causing error noise. This commit fixes the race condition. Now, when the connection goes down, we remove the reference to the heartbeat request that was last sent out. By removing this reference, the coordinator will no longer expect a response to the heartbeat request. Should connection be re-established, the coordinator will start sending out new heartbeat requests as usual. I tested out the solution in my own computer by adding a custom TCP proxy in front of Kafka where I had ability to terminate the connections and introduce additional latency. With this setup, I was able to verify that with the previous version, the same errors that we saw in production happened, but with the changes they no longer showed up. These are the errors that showed up in our logs: ``` Process #PID<0.19777.11> terminating ** (exit) :hb_timeout (stdlib 4.2) gen_server.erl:1241: :gen_server.handle_common_reply/8 (stdlib 4.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Initial Call: :brod_group_coordinator.init/1 Ancestors: [#PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>] Neighbours: #PID<0.6845.12> Initial Call: :kpro_connection.init/4 Current Call: :kpro_connection.loop/2 Ancestors: [#PID<0.19777.11>, #PID<0.19775.11>, CallRouter.Supervisor, #PID<0.4065.0>] ``` ``` GenServer #PID<0.1262.11> terminating ** (stop) :hb_timeout Last message: :lo_cmd_send_heartbeat ``` XT-19

laurglia · 2024-05-08T11:13:55Z

I also opened a PR in the source repository: kafka4beam#578

But I do not know how long the review will take so I would like to try this out with our own fork.

urmastalimaa

Excellent detective work, bravo.

urmastalimaa · 2024-05-08T11:17:26Z

The scenario deserves its own test case (similar to https://github.com/salemove/brod/blob/master/test/brod_group_coordinator_SUITE.erl#L137), but I am happy to merge in our own fork and let the coverage be handled in the parent repo.

laurglia · 2024-05-08T11:26:37Z

The scenario deserves its own test case (similar to https://github.com/salemove/brod/blob/master/test/brod_group_coordinator_SUITE.erl#L137), but I am happy to merge in our own fork and let the coverage be handled in the parent repo.

It is a bit difficult to test this case, since in the test we are currently not in control of what and when the server responds, it seems to use just locally running Kafka. To write a reproducible test, we should also add some middleman that closes the connection when heartbeat is sent, but I'm not sure how to do it.

laurglia requested review from urmastalimaa and kmteras May 8, 2024 11:06

urmastalimaa approved these changes May 8, 2024

View reviewed changes

kmteras approved these changes May 8, 2024

View reviewed changes

laurglia merged commit 07cb4b0 into master May 8, 2024

laurglia deleted the fix_heartbeat_timeout branch May 16, 2024 12:16

rsferreira11 mentioned this pull request Feb 20, 2025

Assign codeowners to tm-sudo #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix coordinator process exiting due to heartbeat race #1

Fix coordinator process exiting due to heartbeat race #1

laurglia commented May 8, 2024

laurglia commented May 8, 2024 •

edited

Loading

urmastalimaa left a comment

urmastalimaa commented May 8, 2024

laurglia commented May 8, 2024

Fix coordinator process exiting due to heartbeat race #1

Fix coordinator process exiting due to heartbeat race #1

Conversation

laurglia commented May 8, 2024

laurglia commented May 8, 2024 • edited Loading

urmastalimaa left a comment

Choose a reason for hiding this comment

urmastalimaa commented May 8, 2024

laurglia commented May 8, 2024

laurglia commented May 8, 2024 •

edited

Loading