Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of errors while sending TCP responses. #309

Merged
merged 15 commits into from
Oct 4, 2024

Conversation

ximon18
Copy link
Member

@ximon18 ximon18 commented May 1, 2024

This PR contains a couple of fixes for TCP connection error handling:

  • Abort a TCP connection if a fatal I/O error is detected when writing to the client, such as the client disconnected.
  • Retry response writes that time out up to a maximum number of retries. Removed from this PR after review feedback
  • Extends the unit tests to test TCP disconnection handling.

…ting to the client, such as the client disconnected.

FIX: Stop processing a response stream if there is no space left in the response queue.
…received and checks that the aborted write counter increases.
@ximon18 ximon18 requested a review from a team May 1, 2024 10:43
@ximon18 ximon18 mentioned this pull request May 1, 2024
@partim
Copy link
Member

partim commented May 6, 2024

I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all and read_exact will already cover ErrorKind::Interrupted. ErrorKind::TimedOut should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.

Retrying when the configured response_write_timeout feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1), which is a bit unexpected.

Looking up ErrorKind::TimedOut, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.

@ximon18 ximon18 force-pushed the improve-tcp-connection-error-handling branch from bbd2d14 to d4deae7 Compare October 3, 2024 11:18
///
/// The value has to be between 0 and 255 with a default of 2. A value of
/// zero means that one try will be attempted.
const RESPONSE_WRITE_RETRIES: DefMinMax<u8> =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. What kind of error do we expect that should be retried. Is there something that is broken in Tokio?

Copy link
Member Author

@ximon18 ximon18 Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I'm wondering myself now. I pulled this out of the xfr branch where I thought it was added to do with problems I experienced in manual testing of an AXFR transfer to a client. In principle Tokio can return an I/O error on writing to the response stream, and this is about retrying in the case that that error is an interrupted or timeout error, but what would cause it that would be resolvable by retrying I don't know. The code was naively added in the thinking that a timeout is a problem that can be overcome by retrying to work around temporary problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the response write retries config setting. The PR still has a useful additional test that I think is worth keeping.

@Philip-NLnetLabs
Copy link
Member

I think we should a clear description what happens and how to reproduce. I don't think a tcp write is supposed to fail. So we have to know why it fails.

@ximon18
Copy link
Member Author

ximon18 commented Oct 3, 2024

I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all and read_exact will already cover ErrorKind::Interrupted. ErrorKind::TimedOut should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.

Retrying when the configured response_write_timeout feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1), which is a bit unexpected.

Looking up ErrorKind::TimedOut, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.

Perhaps the response write timeout code should be removed, I don't know. It wasn't added in this PR however, so for now I leave that to think about, I haven't removed it.

@ximon18
Copy link
Member Author

ximon18 commented Oct 3, 2024

I need to review this PR as I want to be sure that I didn't just accidentally throw out this fix:

Abort a TCP connection if a fatal I/O error is detected when writing to the client, such as the client disconnected.

Update: Yes that was accidentally removed. Naively adding it back in helps when CTRL-C'ing a large dig AXFR transfer preventing a large number of Write error: Broken pipe (os error 32) errors in the log, but instead there are then lots of Unable to queue message for sending: server is shutting down.` errors. That error should probably only be shown once, and I suspect the server isn't shutting down but the connection handler is so the message is also incorrect.

@ximon18 ximon18 marked this pull request as draft October 3, 2024 17:12
Otherwise we keep trying to send to a gone client if we have more responses to send, such as for a large XFR transfer.
…s shutting down or the response queue is full.
@ximon18
Copy link
Member Author

ximon18 commented Oct 3, 2024

The connection is now terminated on error, and the logs of an aborted large XFR transfer look much better:

2024-10-03T19:12:46.301579Z ERROR ThreadId(10) domain::net::server::connection: Write error: Connection reset by peer (os error 104)
2024-10-03T19:12:46.301595Z  WARN ThreadId(10) domain::net::server::connection: Error while shutting down the write stream: Transport endpoint is not connected (os error 107)
2024-10-03T19:12:46.301767Z ERROR ThreadId(30) domain::net::server::connection: Unable to queue message for sending: connection is shutting down.
2024-10-03T19:12:46.302240Z ERROR ThreadId(13) domain::net::server::middleware::xfr::util: Failed to send DNS message to the internal response stream
2024-10-03T19:12:54.977167Z ERROR ThreadId(35) domain::net::server::middleware::xfr::axfr: Internal error: Failed to send final AXFR SOA to batcher: channel closed

…meaning that the message header was wrongly perceived as having the QR bit set so requests were rejected as replies, never being handled and thus never having responses sent, which couldn't fail because the connection was closed... hence breaking the test.
@ximon18 ximon18 marked this pull request as ready for review October 3, 2024 22:45
@ximon18 ximon18 merged commit 146ad36 into main Oct 4, 2024
26 checks passed
@ximon18 ximon18 deleted the improve-tcp-connection-error-handling branch October 4, 2024 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants