Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of errors while sending TCP responses. #309

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ximon18
Copy link
Member

@ximon18 ximon18 commented May 1, 2024

This PR contains a couple of fixes for TCP connection error handling:

  • Abort a TCP connection if a fatal I/O error is detected when writing to the client, such as the client disconnected.
  • Stop processing a response stream if there is no space left in the response queue.

It also adds a metric for the number of aborted writes that occur.

These issues were discovered when dig disconnected because a large AXFR transfer was slow to start and the default dig 5 second timeout expired.

Interestingly these issues were easier to fix in the WIP xfr branch that I found them on as that is based on the service-layering branch which simplifies the response handling.

…ting to the client, such as the client disconnected.

FIX: Stop processing a response stream if there is no space left in the response queue.
…received and checks that the aborted write counter increases.
@ximon18 ximon18 requested a review from a team May 1, 2024 10:43
@ximon18 ximon18 mentioned this pull request May 1, 2024
@partim
Copy link
Member

partim commented May 6, 2024

I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all and read_exact will already cover ErrorKind::Interrupted. ErrorKind::TimedOut should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.

Retrying when the configured response_write_timeout feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1), which is a bit unexpected.

Looking up ErrorKind::TimedOut, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants