New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hostcache: Cache negative name resolves #12406
base: master
Are you sure you want to change the base?
Conversation
This is a first draft, requesting feedback. The change modifies previous behaviour a bit: In master, a failed resolve means the connection has no |
1.9 Cache negative name resolves:
Well, yes, but shouldn't we leave it up to the local DNS server caching? Another thing, occasionally my browser fails to resolve something and it works after a few seconds. I'm pretty sure that's not due to a DNS cache issue in Windows but a network change. Should we really be deciding this instead of leaving it to the DNS server?
I think a separate list for failed resolves would be easier to manage. |
Curl has been caching positive resolve results for 20+ years. This patch just adds negative results to the cache.
I think the timescale is an important difference here. I'd say manually reloading the browser after a few seconds is equivalent to running a curl command again, at which point curl will ask the resolver for a new result. We only cache hosts during one session, where the resolves typically occur within a few milliseconds of each other. For long-running sessions, entries in the host cache expires after 60 seconds.
We already have a cache for storing host lookup results, I don't quite see how adding a second would make the code simpler? |
This seems to be arguments against DNS caching in general and not particularly the negative caching. DNS entries have a TTL field for how long they are expected to remain valid, but unfortunately that is not accessible using the normal getaddrinfo() API calls. Our DoH code gets the TTLs and can therefore cache names a more exact period. We have used 60 second default DNS caching for decades and we have very few reports on problems with that, in spite of the fact that TTLS values below 60 are actually somewhat common these days. (I suspect this is because Internet infra everywhere does not really work well with switching DNS at single-second intervals so in reality things work a little slower than advertised.) We could of course discuss the ability to set a separate timeout value for negative caches, but it's one of those things that would be very hard for users to know when to use or what to set it to. |
Of course I'm aware of the caching of results. In my opinion caching a negative result is different because I would think it's more likely to turn positive in the interim. My experience of negative results has been momentary and therefore not something that should be cached. However I don't know of any facts to back that up, as it would pertain to users overall. It is really the difference between a failed lookup and a lookup that has an affirmative "name does not exist" reply. |
BTW, RFC 2308 is "Negative Caching of DNS Queries (DNS NCACHE)", but I don't think it helps us much. |
Follow up from what @jay said above:
A network change can already also affect a "positive" cache entry so yes we should ideally flush the cache on network change. Although figuring out if one has occurred is not a straight forward operation and is highly system specific. That's another reason to keep the cache timeouts short.
The DNS server is likely many milliseconds away, making that decisions many magnitudes slower. We have a DNS cache since forever for exactly that reason. To avoid relying on the possibility that the stub resolver has a(n effective) cache. We know for example that c-ares builds do not have a local cache (even though it is coming in a future release). |
Sure, I'm not against DNS caching of positive results I think it is very helpful. Negative results in my own experience in the browser have just been so temporary that I hit reload on a failed domain and voila the page loads. Is it Chrome saying that it cannot resolve the host or is it the DNS server saying name not found? I don't know but it's got about a lifetime of 0 because it usually works again pretty quickly. Also it's interesting that in some cases Chrome will automatically reload the page for me when it happens, and sometimes not. |
What do browsers do when a html page wants to load 100 images from a host that can't be resolved? Do they send out 100 resolve requests? |
Firefox caches negative name resolves for 60 seconds: https://searchfox.org/mozilla-central/rev/996f5bf0b9a32aa0620a841abb9531625fe916aa/netwerk/dns/nsHostResolver.cpp#59 Maybe we can find an in-between solution here by setting a lower timeout for negative responses. Maybe half the positive time? The main point for the negative caching is to speed up bursts when an application wants to do N transfers against the same host name in a short period of time and that host name does not resolve (for the moment). |
@zagor you up to modifying this to use a separate (shorter) timeout for the negative entries? |
Since the timeout is customizable, the negative time could perhaps be half the "normal" one. |
Sure, I'll fix that. |
216801d
to
25d0422
Compare
Here's a brand new version of this. This version refactors the hostip api so that Curl_resolv_check() also checks the cache, removing the need for Curl_fetch_addr(). This way hostip callers only need one function, and don't care if the result came from cache or not. This avoids having to add negative cache awareness in a bunch of places. It also sets lifetime of negative entries to half the time of positive entries. |
Ah look what scanbuild found:
|
Yes it's a missed |
I suspect me merging 907eea0 made things a little more complicated for you with the merge conflicts. Sorry for that. |
1e8aae0
to
f044527
Compare
I am unable to reproduce the https://github.com/curl/curl/actions/runs/7219542780/job/19674131381 failure locally. |
I tried it as well on Linux and I could not make it fail locally. If this remains I can give a go on my mac and see if it reproduces. |
ok, lemme check. macOS is such an icky platform... |
I don't get the same crash, but a torture test with this config on mac points out a problem. |
lib/hostip.c
Outdated
} | ||
} | ||
if(data->share) | ||
Curl_share_lock(data, CURL_LOCK_DATA_DNS, CURL_LOCK_ACCESS_SINGLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this unlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Good find!
I can reproduce. I can make it stop by adding printf() lines so there's some timing/race involved. macOS has no valgrind and gdb does not work, so I'm struggling. Will continue later. |
I would like to write or extend some test cases to verify the negative caching but I'm unsure how to do this in the test case syntax. |
A "classic" way we sometimes use is to make the test require For example, you can add a line somewhere that says "cached negative, skipping" then make a test case that gets the same host name twice (without connection reuse) to verify that the second resolve gets that output. |
I finally caught the crash in lldb:
... now we just need to figure out why and stop it =) (This is your branch rebased on top of master, when master is ad50d81) |
It can be caught in an assert like this diff --git a/lib/connect.c b/lib/connect.c
index 4cb6eda11..f6c762040 100644
--- a/lib/connect.c
+++ b/lib/connect.c
@@ -886,10 +886,11 @@ static CURLcode cf_he_connect(struct Curl_cfilter *cf,
switch(ctx->state) {
case SCFST_INIT:
DEBUGASSERT(CURL_SOCKET_BAD == Curl_conn_cf_get_socket(cf, data));
DEBUGASSERT(!cf->connected);
+ DEBUGASSERT(ctx->remotehost);
result = start_connect(cf, data, ctx->remotehost);
if(result)
return result;
ctx->state = SCFST_WAITING;
FALLTHROUGH(); Worth noticing maybe: this test case uses DoH. So curl wants to send two requests to the Doh Server (to resolve a host name to both IPv4 and IPv6). But the DoH server is identified by an IP address. To send its DoH requests(s), curl will resolve IP addresses first (two, one for each transfer) to use when sending the request to the DoH server. On macOS, curl uses the name resolver functions even for plain IP addresses, which is not done on other systems. (for address conversion reasons) This crash happens already when setting up the two DoH requests. It appears to be the second of them (which thus resolves the same host name) that crashes. If I limit the DoH requests to just one (by asking for IPv4 only), it avoids the crash. |
in Curl_once_resolved(), conn->dns_entry is assigned to the async response. But if the result was read from the cache, there is no async response so there is no assign. Then the Here are two commits that could be applied to help tracking this down better. A range of asserts and two (temporary) infof() calls. 0002-hostip-infof-debug-ouputs.patch |
TODO 1.9: Add failed name resolves to the host cache, so that repeated connection attempts do not immediately repeat the resolve request. Lifetime of negative entries is set to half of the positive ones. Refactor the hostip api so that Curl_resolv_check() also checks the cache, removing the need for Curl_fetch_addr(). This way hostip callers only need one function, and don't care if the result came from cache or not.
53d877c
to
2f0ce53
Compare
Fixed that bug, hopefully, and rebased on master. |
|
Add failed name resolves to the host cache, so that repeated connection attempts do not immediately repeat the resolve request.
This fixes TODO 1.9.