hostcache: Cache negative name resolves #12406

zagor · 2023-11-26T20:08:49Z

Add failed name resolves to the host cache, so that repeated connection attempts do not immediately repeat the resolve request.

This fixes TODO 1.9.

zagor · 2023-11-26T20:14:07Z

This is a first draft, requesting feedback. The change modifies previous behaviour a bit: In master, a failed resolve means the connection has no dns_entry. With this patch the dns_entry is still saved, but wihout an addr pointer if the resolve failed.

jay · 2023-11-26T23:39:00Z

1.9 Cache negative name resolves:

"A name resolve that has failed is likely to fail when made again within a short period of time. Currently we only cache positive responses."

Well, yes, but shouldn't we leave it up to the local DNS server caching? Another thing, occasionally my browser fails to resolve something and it works after a few seconds. I'm pretty sure that's not due to a DNS cache issue in Windows but a network change. Should we really be deciding this instead of leaving it to the DNS server?

With this patch the dns_entry is still saved, but wihout an addr pointer if the resolve failed.

I think a separate list for failed resolves would be easier to manage.

zagor · 2023-11-27T08:19:35Z

shouldn't we leave it up to the local DNS server caching?

Curl has been caching positive resolve results for 20+ years. This patch just adds negative results to the cache.

Another thing, occasionally my browser fails to resolve something and it works after a few seconds.

I think the timescale is an important difference here. I'd say manually reloading the browser after a few seconds is equivalent to running a curl command again, at which point curl will ask the resolver for a new result. We only cache hosts during one session, where the resolves typically occur within a few milliseconds of each other.

For long-running sessions, entries in the host cache expires after 60 seconds.

I think a separate list for failed resolves would be easier to manage.

We already have a cache for storing host lookup results, I don't quite see how adding a second would make the code simpler?

bagder · 2023-11-27T08:33:05Z

Well, yes, but shouldn't we leave it up to the local DNS server caching? Another thing, occasionally my browser fails to resolve something and it works after a few seconds

This seems to be arguments against DNS caching in general and not particularly the negative caching. DNS entries have a TTL field for how long they are expected to remain valid, but unfortunately that is not accessible using the normal getaddrinfo() API calls. Our DoH code gets the TTLs and can therefore cache names a more exact period.

We have used 60 second default DNS caching for decades and we have very few reports on problems with that, in spite of the fact that TTLS values below 60 are actually somewhat common these days. (I suspect this is because Internet infra everywhere does not really work well with switching DNS at single-second intervals so in reality things work a little slower than advertised.)

We could of course discuss the ability to set a separate timeout value for negative caches, but it's one of those things that would be very hard for users to know when to use or what to set it to.

jay · 2023-11-27T08:45:56Z

Of course I'm aware of the caching of results. In my opinion caching a negative result is different because I would think it's more likely to turn positive in the interim. My experience of negative results has been momentary and therefore not something that should be cached. However I don't know of any facts to back that up, as it would pertain to users overall. It is really the difference between a failed lookup and a lookup that has an affirmative "name does not exist" reply.

bagder · 2023-11-27T08:55:04Z

BTW, RFC 2308 is "Negative Caching of DNS Queries (DNS NCACHE)", but I don't think it helps us much.

bagder · 2023-11-28T16:30:03Z

Follow up from what @jay said above:

I'm pretty sure that's not due to a DNS cache issue in Windows but a network change.

A network change can already also affect a "positive" cache entry so yes we should ideally flush the cache on network change. Although figuring out if one has occurred is not a straight forward operation and is highly system specific. That's another reason to keep the cache timeouts short.

Should we really be deciding this instead of leaving it to the DNS server?

The DNS server is likely many milliseconds away, making that decisions many magnitudes slower. We have a DNS cache since forever for exactly that reason. To avoid relying on the possibility that the stub resolver has a(n effective) cache. We know for example that c-ares builds do not have a local cache (even though it is coming in a future release).

jay · 2023-11-28T19:11:23Z

Sure, I'm not against DNS caching of positive results I think it is very helpful. Negative results in my own experience in the browser have just been so temporary that I hit reload on a failed domain and voila the page loads. Is it Chrome saying that it cannot resolve the host or is it the DNS server saying name not found? I don't know but it's got about a lifetime of 0 because it usually works again pretty quickly. Also it's interesting that in some cases Chrome will automatically reload the page for me when it happens, and sometimes not.

zagor · 2023-11-29T08:03:25Z

What do browsers do when a html page wants to load 100 images from a host that can't be resolved? Do they send out 100 resolve requests?

bagder · 2023-11-29T09:36:44Z

Firefox caches negative name resolves for 60 seconds: https://searchfox.org/mozilla-central/rev/996f5bf0b9a32aa0620a841abb9531625fe916aa/netwerk/dns/nsHostResolver.cpp#59

Maybe we can find an in-between solution here by setting a lower timeout for negative responses. Maybe half the positive time? The main point for the negative caching is to speed up bursts when an application wants to do N transfers against the same host name in a short period of time and that host name does not resolve (for the moment).

bagder · 2023-12-07T08:03:29Z

@zagor you up to modifying this to use a separate (shorter) timeout for the negative entries?

bagder · 2023-12-07T08:04:06Z

Since the timeout is customizable, the negative time could perhaps be half the "normal" one.

zagor · 2023-12-07T08:06:20Z

Sure, I'll fix that.

zagor · 2023-12-14T22:36:47Z

Here's a brand new version of this.

This version refactors the hostip api so that Curl_resolv_check() also checks the cache, removing the need for Curl_fetch_addr(). This way hostip callers only need one function, and don't care if the result came from cache or not. This avoids having to add negative cache awareness in a bunch of places.

It also sets lifetime of negative entries to half the time of positive entries.

bagder · 2023-12-15T10:11:41Z

Ah look what scanbuild found:

hostip.c:1386:11: warning: Value stored to 'result' is never read [deadcode.DeadStores]
          result = CURLE_COULDNT_RESOLVE_PROXY;
          ^        ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.

zagor · 2023-12-15T10:22:20Z

Yes it's a missed else. I'll commit it as soon as this CI run is completed. 😄

bagder · 2023-12-15T15:40:39Z

I suspect me merging 907eea0 made things a little more complicated for you with the merge conflicts. Sorry for that.

zagor · 2023-12-15T16:21:56Z

I am unable to reproduce the https://github.com/curl/curl/actions/runs/7219542780/job/19674131381 failure locally.
Any ideas what might be wrong?

bagder · 2023-12-15T16:24:06Z

I am unable to reproduce the https://github.com/curl/curl/actions/runs/7219542780/job/19674131381 failure locally.

I tried it as well on Linux and I could not make it fail locally. If this remains I can give a go on my mac and see if it reproduces.

bagder · 2023-12-15T17:38:05Z

ok, lemme check. macOS is such an icky platform...

bagder · 2023-12-15T17:41:34Z

I don't get the same crash, but a torture test with this config on mac points out a problem.

bagder · 2023-12-15T22:43:20Z

lib/hostip.c

+ }
+ }
+ if(data->share)
+ Curl_share_lock(data, CURL_LOCK_DATA_DNS, CURL_LOCK_ACCESS_SINGLE);


Should this unlock?

Yes. Good find!

bagder · 2023-12-15T23:12:17Z

I can reproduce. I can make it stop by adding printf() lines so there's some timing/race involved.

macOS has no valgrind and gdb does not work, so I'm struggling. Will continue later.

zagor · 2023-12-17T11:24:57Z

I would like to write or extend some test cases to verify the negative caching but I'm unsure how to do this in the test case syntax.

bagder · 2023-12-18T08:16:06Z

A "classic" way we sometimes use is to make the test require debug, because then you can add #ifdef DEBUGBUILD things that can do basically anything since it won't be there for release. Like for example outputting something with an infof() call that you can verify in the test case.

For example, you can add a line somewhere that says "cached negative, skipping" then make a test case that gets the same host name twice (without connection reuse) to verify that the second resolve gets that output.

bagder · 2023-12-18T09:07:37Z

I finally caught the crash in lldb:

(lldb) target create "../src/curl"
Current executable set to '/Users/danielstenberg/src/curl/src/curl' (arm64).
(lldb) settings set -- target.run-args  "--output" "log/curl2100.out" "--include" "--trace-ascii" "log/trace2100" "--trace-time" "http://foo.example.com:49634/2100" "--doh-url" "http://127.0.0.1:49634/21000001"
(lldb) run
Process 83218 launched: '/Users/danielstenberg/src/curl/src/curl' (arm64)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
Process 83218 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001000440a4 curl`start_connect(cf=0x0000600001700a88, data=0x0000000100816c08, remotehost=0x0000000000000000) at connect.c:773:30
   770     */
   771    if(conn->ip_version == CURL_IPRESOLVE_WHATEVER) {
   772      /* any IP version is allowed */
-> 773      ai_family0 = remotehost->addr?
   774        remotehost->addr->ai_family : 0;
   775  #ifdef ENABLE_IPV6
   776      ai_family1 = ai_family0 == AF_INET6 ?
Target 0: (curl) stopped.
(lldb) p remotehost
(const Curl_dns_entry *) $0 = NULL
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001000440a4 curl`start_connect(cf=0x0000600001700a88, data=0x0000000100816c08, remotehost=0x0000000000000000) at connect.c:773:30
    frame #1: 0x0000000100042828 curl`cf_he_connect(cf=0x0000600001700a88, data=0x0000000100816c08, blocking=false, done=0x000000016fdfef77) at connect.c:891:16
    frame #2: 0x000000010003e634 curl`Curl_conn_cf_connect(cf=0x0000600001700a88, data=0x0000000100816c08, blocking=false, done=0x000000016fdfef77) at cfilters.c:299:12
    frame #3: 0x0000000100043374 curl`cf_setup_connect(cf=0x0000600001700c08, data=0x0000000100816c08, blocking=false, done=0x000000016fdfef77) at connect.c:1192:14
    frame #4: 0x000000010003e940 curl`Curl_conn_connect(data=0x0000000100816c08, sockindex=0, blocking=false, done=0x000000016fdfef77) at cfilters.c:345:14
    frame #5: 0x0000000100081478 curl`multi_runsingle(multi=0x0000000100204f48, nowp=0x000000016fdfefc0, data=0x0000000100816c08) at multi.c:2036:16
    frame #6: 0x0000000100080b80 curl`curl_multi_perform(multi=0x0000000100204f48, running_handles=0x000000016fdff028) at multi.c:2675:16
    frame #7: 0x0000000100054298 curl`easy_transfer(multi=0x0000000100204f48) at easy.c:675:15
    frame #8: 0x0000000100053070 curl`easy_perform(data=0x000000010080c808, events=false) at easy.c:765:42
    frame #9: 0x0000000100052e88 curl`curl_easy_perform(data=0x000000010080c808) at easy.c:784:10
    frame #10: 0x000000010001795c curl`serial_transfers(global=0x000000016fdff228, share=0x00006000039040d8) at tool_operate.c:2500:16
    frame #11: 0x0000000100017170 curl`run_all_transfers(global=0x000000016fdff228, share=0x00006000039040d8, result=CURLE_OK) at tool_operate.c:2691:16
    frame #12: 0x0000000100017080 curl`operate(global=0x000000016fdff228, argc=10, argv=0x000000016fdff548) at tool_operate.c:2807:20
    frame #13: 0x00000001000161e4 curl`main(argc=10, argv=0x000000016fdff548) at tool_main.c:273:14
    frame #14: 0x00000001a9b57f28 dyld`start + 2236

... now we just need to figure out why and stop it =)

(This is your branch rebased on top of master, when master is ad50d81)

bagder · 2023-12-18T13:45:44Z

It can be caught in an assert like this

diff --git a/lib/connect.c b/lib/connect.c
index 4cb6eda11..f6c762040 100644
--- a/lib/connect.c
+++ b/lib/connect.c
@@ -886,10 +886,11 @@ static CURLcode cf_he_connect(struct Curl_cfilter *cf,
 
   switch(ctx->state) {
     case SCFST_INIT:
       DEBUGASSERT(CURL_SOCKET_BAD == Curl_conn_cf_get_socket(cf, data));
       DEBUGASSERT(!cf->connected);
+      DEBUGASSERT(ctx->remotehost);
       result = start_connect(cf, data, ctx->remotehost);
       if(result)
         return result;
       ctx->state = SCFST_WAITING;
       FALLTHROUGH();

Worth noticing maybe: this test case uses DoH. So curl wants to send two requests to the Doh Server (to resolve a host name to both IPv4 and IPv6). But the DoH server is identified by an IP address.

To send its DoH requests(s), curl will resolve IP addresses first (two, one for each transfer) to use when sending the request to the DoH server.

On macOS, curl uses the name resolver functions even for plain IP addresses, which is not done on other systems. (for address conversion reasons)

This crash happens already when setting up the two DoH requests. It appears to be the second of them (which thus resolves the same host name) that crashes. If I limit the DoH requests to just one (by asking for IPv4 only), it avoids the crash.

bagder · 2023-12-22T13:58:23Z

in Curl_once_resolved(), conn->dns_entry is assigned to the async response. But if the result was read from the cache, there is no async response so there is no assign. Then the conn->dns_entry pointer remains NULL and is later dereferenced.

Here are two commits that could be applied to help tracking this down better. A range of asserts and two (temporary) infof() calls.

0002-hostip-infof-debug-ouputs.patch
0001-hostip-add-several-DEBUGASSERT-s.patch

TODO 1.9: Add failed name resolves to the host cache, so that repeated connection attempts do not immediately repeat the resolve request. Lifetime of negative entries is set to half of the positive ones. Refactor the hostip api so that Curl_resolv_check() also checks the cache, removing the need for Curl_fetch_addr(). This way hostip callers only need one function, and don't care if the result came from cache or not.

zagor · 2023-12-22T16:23:22Z

Fixed that bug, hopefully, and rebased on master.

bagder · 2023-12-24T22:53:07Z

docs/libcurl/opts/CURLOPT_DNS_CACHE_TIMEOUT.3 probably needs an update to describe this new behavior, right?

bagder added the name lookup DNS and related tech label Nov 26, 2023

bagder mentioned this pull request Dec 7, 2023

Async DNS resolution on Win8+ with GetAddrInfoExW #12481

Closed

zagor force-pushed the cache-negative-name-resolves branch from 216801d to 25d0422 Compare December 14, 2023 22:33

zagor force-pushed the cache-negative-name-resolves branch from 1e8aae0 to f044527 Compare December 15, 2023 16:20

bagder reviewed Dec 15, 2023

View reviewed changes

zagor force-pushed the cache-negative-name-resolves branch from 53d877c to 2f0ce53 Compare December 22, 2023 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hostcache: Cache negative name resolves #12406

hostcache: Cache negative name resolves #12406

zagor commented Nov 26, 2023

zagor commented Nov 26, 2023

jay commented Nov 26, 2023

zagor commented Nov 27, 2023

bagder commented Nov 27, 2023

jay commented Nov 27, 2023 •

edited

bagder commented Nov 27, 2023

bagder commented Nov 28, 2023

jay commented Nov 28, 2023

zagor commented Nov 29, 2023

bagder commented Nov 29, 2023

bagder commented Dec 7, 2023

bagder commented Dec 7, 2023

zagor commented Dec 7, 2023

zagor commented Dec 14, 2023

bagder commented Dec 15, 2023

zagor commented Dec 15, 2023

bagder commented Dec 15, 2023

zagor commented Dec 15, 2023

bagder commented Dec 15, 2023

bagder commented Dec 15, 2023

bagder commented Dec 15, 2023

bagder Dec 15, 2023

zagor Dec 16, 2023

bagder commented Dec 15, 2023

zagor commented Dec 17, 2023

bagder commented Dec 18, 2023

bagder commented Dec 18, 2023 •

edited

bagder commented Dec 18, 2023

bagder commented Dec 22, 2023 •

edited

zagor commented Dec 22, 2023

bagder commented Dec 24, 2023

hostcache: Cache negative name resolves #12406

Are you sure you want to change the base?

hostcache: Cache negative name resolves #12406

Conversation

zagor commented Nov 26, 2023

zagor commented Nov 26, 2023

jay commented Nov 26, 2023

zagor commented Nov 27, 2023

bagder commented Nov 27, 2023

jay commented Nov 27, 2023 • edited

bagder commented Nov 27, 2023

bagder commented Nov 28, 2023

jay commented Nov 28, 2023

zagor commented Nov 29, 2023

bagder commented Nov 29, 2023

bagder commented Dec 7, 2023

bagder commented Dec 7, 2023

zagor commented Dec 7, 2023

zagor commented Dec 14, 2023

bagder commented Dec 15, 2023

zagor commented Dec 15, 2023

bagder commented Dec 15, 2023

zagor commented Dec 15, 2023

bagder commented Dec 15, 2023

bagder commented Dec 15, 2023

bagder commented Dec 15, 2023

bagder Dec 15, 2023

Choose a reason for hiding this comment

zagor Dec 16, 2023

Choose a reason for hiding this comment

bagder commented Dec 15, 2023

zagor commented Dec 17, 2023

bagder commented Dec 18, 2023

bagder commented Dec 18, 2023 • edited

bagder commented Dec 18, 2023

bagder commented Dec 22, 2023 • edited

zagor commented Dec 22, 2023

bagder commented Dec 24, 2023

jay commented Nov 27, 2023 •

edited

bagder commented Dec 18, 2023 •

edited

bagder commented Dec 22, 2023 •

edited