Improved handling of errors while sending TCP responses. #309

ximon18 · 2024-05-01T10:43:36Z

This PR contains a couple of fixes for TCP connection error handling:

Abort a TCP connection if a fatal I/O error is detected when writing to the client, such as the client disconnected.
Retry response writes that time out up to a maximum number of retries. Removed from this PR after review feedback
Extends the unit tests to test TCP disconnection handling.

…ting to the client, such as the client disconnected. FIX: Stop processing a response stream if there is no space left in the response queue.

…received and checks that the aborted write counter increases.

partim · 2024-05-06T13:24:06Z

I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all and read_exact will already cover ErrorKind::Interrupted. ErrorKind::TimedOut should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.

Retrying when the configured response_write_timeout feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1), which is a bit unexpected.

Looking up ErrorKind::TimedOut, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.

Philip-NLnetLabs · 2024-10-03T14:12:38Z

src/net/server/connection.rs

+///
+/// The value has to be between 0 and 255 with a default of 2. A value of
+/// zero means that one try will be attempted.
+const RESPONSE_WRITE_RETRIES: DefMinMax<u8> =


I don't understand this. What kind of error do we expect that should be retried. Is there something that is broken in Tokio?

Good question. I'm wondering myself now. I pulled this out of the xfr branch where I thought it was added to do with problems I experienced in manual testing of an AXFR transfer to a client. In principle Tokio can return an I/O error on writing to the response stream, and this is about retrying in the case that that error is an interrupted or timeout error, but what would cause it that would be resolvable by retrying I don't know. The code was naively added in the thinking that a timeout is a problem that can be overcome by retrying to work around temporary problems.

I have removed the response write retries config setting. The PR still has a useful additional test that I think is worth keeping.

Philip-NLnetLabs · 2024-10-03T14:42:28Z

I think we should a clear description what happens and how to reproduce. I don't think a tcp write is supposed to fail. So we have to know why it fails.

ximon18 · 2024-10-03T17:11:48Z

I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all and read_exact will already cover ErrorKind::Interrupted. ErrorKind::TimedOut should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.

Retrying when the configured response_write_timeout feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1), which is a bit unexpected.

Looking up ErrorKind::TimedOut, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.

Perhaps the response write timeout code should be removed, I don't know. It wasn't added in this PR however, so for now I leave that to think about, I haven't removed it.

ximon18 · 2024-10-03T17:12:27Z

I need to review this PR as I want to be sure that I didn't just accidentally throw out this fix:

Abort a TCP connection if a fatal I/O error is detected when writing to the client, such as the client disconnected.

Update: Yes that was accidentally removed. Naively adding it back in helps when CTRL-C'ing a large dig AXFR transfer preventing a large number of Write error: Broken pipe (os error 32) errors in the log, but instead there are then lots of Unable to queue message for sending: server is shutting down.` errors. That error should probably only be shown once, and I suspect the server isn't shutting down but the connection handler is so the message is also incorrect.

Otherwise we keep trying to send to a gone client if we have more responses to send, such as for a large XFR transfer.

…s shutting down or the response queue is full.

ximon18 · 2024-10-03T19:13:58Z

The connection is now terminated on error, and the logs of an aborted large XFR transfer look much better:

2024-10-03T19:12:46.301579Z ERROR ThreadId(10) domain::net::server::connection: Write error: Connection reset by peer (os error 104)
2024-10-03T19:12:46.301595Z  WARN ThreadId(10) domain::net::server::connection: Error while shutting down the write stream: Transport endpoint is not connected (os error 107)
2024-10-03T19:12:46.301767Z ERROR ThreadId(30) domain::net::server::connection: Unable to queue message for sending: connection is shutting down.
2024-10-03T19:12:46.302240Z ERROR ThreadId(13) domain::net::server::middleware::xfr::util: Failed to send DNS message to the internal response stream
2024-10-03T19:12:54.977167Z ERROR ThreadId(35) domain::net::server::middleware::xfr::axfr: Internal error: Failed to send final AXFR SOA to batcher: channel closed

…meaning that the message header was wrongly perceived as having the QR bit set so requests were rejected as replies, never being handled and thus never having responses sent, which couldn't fail because the connection was closed... hence breaking the test.

ximon18 added 4 commits May 1, 2024 12:16

FIX: Abort a TCP connection if a fatal I/O error is detected when wri…

7656177

…ting to the client, such as the client disconnected. FIX: Stop processing a response stream if there is no space left in the response queue.

FIX: Don't bypass metric handling.

5bb9ed3

Add a metric counting the number of aborted writes.

6fa34e0

Add a test that deliberately disconnects before the last response is …

aa9cb0f

…received and checks that the aborted write counter increases.

ximon18 requested a review from a team May 1, 2024 10:43

ximon18 mentioned this pull request May 1, 2024

Service layering #307

Merged

ximon18 added 2 commits October 3, 2024 13:12

Merge branch 'main' into improve-tcp-connection-error-handling

c5aac30

Post merge compilation fixes.

d4deae7

ximon18 force-pushed the improve-tcp-connection-error-handling branch from bbd2d14 to d4deae7 Compare October 3, 2024 11:18

ximon18 added 3 commits October 3, 2024 13:30

Removed commented out code and fix a comment typo.

949ef45

Remove unrelated changes.

2f7deec

Revert incorrect marker comment line length changes.

8985de4

Philip-NLnetLabs reviewed Oct 3, 2024

View reviewed changes

Review feedback: Remove response write retries configuration setting.

9bc9c9e

ximon18 marked this pull request as draft October 3, 2024 17:12

ximon18 added 4 commits October 3, 2024 20:19

Remove unnecessary ServiceError error variant.

e59a945

Disconnect TCP client if sending a response fails.

e0f261f

Otherwise we keep trying to send to a gone client if we have more responses to send, such as for a large XFR transfer.

Merge branch 'main' into improve-tcp-connection-error-handling

b29bbe1

Don't attempt to send more responses to a request if the connection i…

97d63de

…s shutting down or the response queue is full.

ximon18 marked this pull request as ready for review October 3, 2024 22:45

Philip-NLnetLabs approved these changes Oct 4, 2024

View reviewed changes

ximon18 merged commit 146ad36 into main Oct 4, 2024
26 checks passed

ximon18 deleted the improve-tcp-connection-error-handling branch October 4, 2024 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved handling of errors while sending TCP responses. #309

Improved handling of errors while sending TCP responses. #309

ximon18 commented May 1, 2024 •

edited

Loading

partim commented May 6, 2024 •

edited

Loading

Philip-NLnetLabs Oct 3, 2024

ximon18 Oct 3, 2024 •

edited

Loading

ximon18 Oct 3, 2024

Philip-NLnetLabs commented Oct 3, 2024

ximon18 commented Oct 3, 2024

ximon18 commented Oct 3, 2024 •

edited

Loading

ximon18 commented Oct 3, 2024

Improved handling of errors while sending TCP responses. #309

Improved handling of errors while sending TCP responses. #309

Conversation

ximon18 commented May 1, 2024 • edited Loading

partim commented May 6, 2024 • edited Loading

Philip-NLnetLabs Oct 3, 2024

Choose a reason for hiding this comment

ximon18 Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

ximon18 Oct 3, 2024

Choose a reason for hiding this comment

Philip-NLnetLabs commented Oct 3, 2024

ximon18 commented Oct 3, 2024

ximon18 commented Oct 3, 2024 • edited Loading

ximon18 commented Oct 3, 2024

ximon18 commented May 1, 2024 •

edited

Loading

partim commented May 6, 2024 •

edited

Loading

ximon18 Oct 3, 2024 •

edited

Loading

ximon18 commented Oct 3, 2024 •

edited

Loading