Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostcache: Cache negative name resolves #12406

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zagor
Copy link
Member

@zagor zagor commented Nov 26, 2023

Add failed name resolves to the host cache, so that repeated connection attempts do not immediately repeat the resolve request.

This fixes TODO 1.9.

@zagor
Copy link
Member Author

zagor commented Nov 26, 2023

This is a first draft, requesting feedback. The change modifies previous behaviour a bit: In master, a failed resolve means the connection has no dns_entry. With this patch the dns_entry is still saved, but wihout an addr pointer if the resolve failed.

@bagder bagder added the name lookup DNS and related tech label Nov 26, 2023
@jay
Copy link
Member

jay commented Nov 26, 2023

1.9 Cache negative name resolves:

"A name resolve that has failed is likely to fail when made again within a short period of time. Currently we only cache positive responses."

Well, yes, but shouldn't we leave it up to the local DNS server caching? Another thing, occasionally my browser fails to resolve something and it works after a few seconds. I'm pretty sure that's not due to a DNS cache issue in Windows but a network change. Should we really be deciding this instead of leaving it to the DNS server?

With this patch the dns_entry is still saved, but wihout an addr pointer if the resolve failed.

I think a separate list for failed resolves would be easier to manage.

@zagor
Copy link
Member Author

zagor commented Nov 27, 2023

shouldn't we leave it up to the local DNS server caching?

Curl has been caching positive resolve results for 20+ years. This patch just adds negative results to the cache.

Another thing, occasionally my browser fails to resolve something and it works after a few seconds.

I think the timescale is an important difference here. I'd say manually reloading the browser after a few seconds is equivalent to running a curl command again, at which point curl will ask the resolver for a new result. We only cache hosts during one session, where the resolves typically occur within a few milliseconds of each other.

For long-running sessions, entries in the host cache expires after 60 seconds.

I think a separate list for failed resolves would be easier to manage.

We already have a cache for storing host lookup results, I don't quite see how adding a second would make the code simpler?

@bagder
Copy link
Member

bagder commented Nov 27, 2023

Well, yes, but shouldn't we leave it up to the local DNS server caching? Another thing, occasionally my browser fails to resolve something and it works after a few seconds

This seems to be arguments against DNS caching in general and not particularly the negative caching. DNS entries have a TTL field for how long they are expected to remain valid, but unfortunately that is not accessible using the normal getaddrinfo() API calls. Our DoH code gets the TTLs and can therefore cache names a more exact period.

We have used 60 second default DNS caching for decades and we have very few reports on problems with that, in spite of the fact that TTLS values below 60 are actually somewhat common these days. (I suspect this is because Internet infra everywhere does not really work well with switching DNS at single-second intervals so in reality things work a little slower than advertised.)

We could of course discuss the ability to set a separate timeout value for negative caches, but it's one of those things that would be very hard for users to know when to use or what to set it to.

@jay
Copy link
Member

jay commented Nov 27, 2023

Of course I'm aware of the caching of results. In my opinion caching a negative result is different because I would think it's more likely to turn positive in the interim. My experience of negative results has been momentary and therefore not something that should be cached. However I don't know of any facts to back that up, as it would pertain to users overall. It is really the difference between a failed lookup and a lookup that has an affirmative "name does not exist" reply.

@bagder
Copy link
Member

bagder commented Nov 27, 2023

BTW, RFC 2308 is "Negative Caching of DNS Queries (DNS NCACHE)", but I don't think it helps us much.

@bagder
Copy link
Member

bagder commented Nov 28, 2023

Follow up from what @jay said above:

I'm pretty sure that's not due to a DNS cache issue in Windows but a network change.

A network change can already also affect a "positive" cache entry so yes we should ideally flush the cache on network change. Although figuring out if one has occurred is not a straight forward operation and is highly system specific. That's another reason to keep the cache timeouts short.

Should we really be deciding this instead of leaving it to the DNS server?

The DNS server is likely many milliseconds away, making that decisions many magnitudes slower. We have a DNS cache since forever for exactly that reason. To avoid relying on the possibility that the stub resolver has a(n effective) cache. We know for example that c-ares builds do not have a local cache (even though it is coming in a future release).

@jay
Copy link
Member

jay commented Nov 28, 2023

Sure, I'm not against DNS caching of positive results I think it is very helpful. Negative results in my own experience in the browser have just been so temporary that I hit reload on a failed domain and voila the page loads. Is it Chrome saying that it cannot resolve the host or is it the DNS server saying name not found? I don't know but it's got about a lifetime of 0 because it usually works again pretty quickly. Also it's interesting that in some cases Chrome will automatically reload the page for me when it happens, and sometimes not.

@zagor
Copy link
Member Author

zagor commented Nov 29, 2023

What do browsers do when a html page wants to load 100 images from a host that can't be resolved? Do they send out 100 resolve requests?

@bagder
Copy link
Member

bagder commented Nov 29, 2023

Firefox caches negative name resolves for 60 seconds: https://searchfox.org/mozilla-central/rev/996f5bf0b9a32aa0620a841abb9531625fe916aa/netwerk/dns/nsHostResolver.cpp#59

Maybe we can find an in-between solution here by setting a lower timeout for negative responses. Maybe half the positive time? The main point for the negative caching is to speed up bursts when an application wants to do N transfers against the same host name in a short period of time and that host name does not resolve (for the moment).

@bagder
Copy link
Member

bagder commented Dec 7, 2023

@zagor you up to modifying this to use a separate (shorter) timeout for the negative entries?

@bagder
Copy link
Member

bagder commented Dec 7, 2023

Since the timeout is customizable, the negative time could perhaps be half the "normal" one.

@zagor
Copy link
Member Author

zagor commented Dec 7, 2023

Sure, I'll fix that.

@zagor
Copy link
Member Author

zagor commented Dec 14, 2023

Here's a brand new version of this.

This version refactors the hostip api so that Curl_resolv_check() also checks the cache, removing the need for Curl_fetch_addr(). This way hostip callers only need one function, and don't care if the result came from cache or not. This avoids having to add negative cache awareness in a bunch of places.

It also sets lifetime of negative entries to half the time of positive entries.

@bagder
Copy link
Member

bagder commented Dec 15, 2023

Ah look what scanbuild found:

hostip.c:1386:11: warning: Value stored to 'result' is never read [deadcode.DeadStores]
          result = CURLE_COULDNT_RESOLVE_PROXY;
          ^        ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.

@zagor
Copy link
Member Author

zagor commented Dec 15, 2023

Yes it's a missed else. I'll commit it as soon as this CI run is completed. 😄

@bagder
Copy link
Member

bagder commented Dec 15, 2023

I suspect me merging 907eea0 made things a little more complicated for you with the merge conflicts. Sorry for that.

@zagor zagor force-pushed the cache-negative-name-resolves branch from 1e8aae0 to f044527 Compare December 15, 2023 16:20
@zagor
Copy link
Member Author

zagor commented Dec 15, 2023

I am unable to reproduce the https://github.com/curl/curl/actions/runs/7219542780/job/19674131381 failure locally.
Any ideas what might be wrong?

@bagder
Copy link
Member

bagder commented Dec 15, 2023

I am unable to reproduce the https://github.com/curl/curl/actions/runs/7219542780/job/19674131381 failure locally.

I tried it as well on Linux and I could not make it fail locally. If this remains I can give a go on my mac and see if it reproduces.

@bagder
Copy link
Member

bagder commented Dec 15, 2023

ok, lemme check. macOS is such an icky platform...

@bagder
Copy link
Member

bagder commented Dec 15, 2023

I don't get the same crash, but a torture test with this config on mac points out a problem.

lib/hostip.c Outdated
}
}
if(data->share)
Curl_share_lock(data, CURL_LOCK_DATA_DNS, CURL_LOCK_ACCESS_SINGLE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this unlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Good find!

@bagder
Copy link
Member

bagder commented Dec 15, 2023

I can reproduce. I can make it stop by adding printf() lines so there's some timing/race involved.

macOS has no valgrind and gdb does not work, so I'm struggling. Will continue later.

@zagor
Copy link
Member Author

zagor commented Dec 17, 2023

I would like to write or extend some test cases to verify the negative caching but I'm unsure how to do this in the test case syntax.

@bagder
Copy link
Member

bagder commented Dec 18, 2023

A "classic" way we sometimes use is to make the test require debug, because then you can add #ifdef DEBUGBUILD things that can do basically anything since it won't be there for release. Like for example outputting something with an infof() call that you can verify in the test case.

For example, you can add a line somewhere that says "cached negative, skipping" then make a test case that gets the same host name twice (without connection reuse) to verify that the second resolve gets that output.

@bagder
Copy link
Member

bagder commented Dec 18, 2023

I finally caught the crash in lldb:

(lldb) target create "../src/curl"
Current executable set to '/Users/danielstenberg/src/curl/src/curl' (arm64).
(lldb) settings set -- target.run-args  "--output" "log/curl2100.out" "--include" "--trace-ascii" "log/trace2100" "--trace-time" "http://foo.example.com:49634/2100" "--doh-url" "http://127.0.0.1:49634/21000001"
(lldb) run
Process 83218 launched: '/Users/danielstenberg/src/curl/src/curl' (arm64)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
Process 83218 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001000440a4 curl`start_connect(cf=0x0000600001700a88, data=0x0000000100816c08, remotehost=0x0000000000000000) at connect.c:773:30
   770     */
   771    if(conn->ip_version == CURL_IPRESOLVE_WHATEVER) {
   772      /* any IP version is allowed */
-> 773      ai_family0 = remotehost->addr?
   774        remotehost->addr->ai_family : 0;
   775  #ifdef ENABLE_IPV6
   776      ai_family1 = ai_family0 == AF_INET6 ?
Target 0: (curl) stopped.
(lldb) p remotehost
(const Curl_dns_entry *) $0 = NULL
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001000440a4 curl`start_connect(cf=0x0000600001700a88, data=0x0000000100816c08, remotehost=0x0000000000000000) at connect.c:773:30
    frame #1: 0x0000000100042828 curl`cf_he_connect(cf=0x0000600001700a88, data=0x0000000100816c08, blocking=false, done=0x000000016fdfef77) at connect.c:891:16
    frame #2: 0x000000010003e634 curl`Curl_conn_cf_connect(cf=0x0000600001700a88, data=0x0000000100816c08, blocking=false, done=0x000000016fdfef77) at cfilters.c:299:12
    frame #3: 0x0000000100043374 curl`cf_setup_connect(cf=0x0000600001700c08, data=0x0000000100816c08, blocking=false, done=0x000000016fdfef77) at connect.c:1192:14
    frame #4: 0x000000010003e940 curl`Curl_conn_connect(data=0x0000000100816c08, sockindex=0, blocking=false, done=0x000000016fdfef77) at cfilters.c:345:14
    frame #5: 0x0000000100081478 curl`multi_runsingle(multi=0x0000000100204f48, nowp=0x000000016fdfefc0, data=0x0000000100816c08) at multi.c:2036:16
    frame #6: 0x0000000100080b80 curl`curl_multi_perform(multi=0x0000000100204f48, running_handles=0x000000016fdff028) at multi.c:2675:16
    frame #7: 0x0000000100054298 curl`easy_transfer(multi=0x0000000100204f48) at easy.c:675:15
    frame #8: 0x0000000100053070 curl`easy_perform(data=0x000000010080c808, events=false) at easy.c:765:42
    frame #9: 0x0000000100052e88 curl`curl_easy_perform(data=0x000000010080c808) at easy.c:784:10
    frame #10: 0x000000010001795c curl`serial_transfers(global=0x000000016fdff228, share=0x00006000039040d8) at tool_operate.c:2500:16
    frame #11: 0x0000000100017170 curl`run_all_transfers(global=0x000000016fdff228, share=0x00006000039040d8, result=CURLE_OK) at tool_operate.c:2691:16
    frame #12: 0x0000000100017080 curl`operate(global=0x000000016fdff228, argc=10, argv=0x000000016fdff548) at tool_operate.c:2807:20
    frame #13: 0x00000001000161e4 curl`main(argc=10, argv=0x000000016fdff548) at tool_main.c:273:14
    frame #14: 0x00000001a9b57f28 dyld`start + 2236

... now we just need to figure out why and stop it =)

(This is your branch rebased on top of master, when master is ad50d81)

@bagder
Copy link
Member

bagder commented Dec 18, 2023

It can be caught in an assert like this

diff --git a/lib/connect.c b/lib/connect.c
index 4cb6eda11..f6c762040 100644
--- a/lib/connect.c
+++ b/lib/connect.c
@@ -886,10 +886,11 @@ static CURLcode cf_he_connect(struct Curl_cfilter *cf,
 
   switch(ctx->state) {
     case SCFST_INIT:
       DEBUGASSERT(CURL_SOCKET_BAD == Curl_conn_cf_get_socket(cf, data));
       DEBUGASSERT(!cf->connected);
+      DEBUGASSERT(ctx->remotehost);
       result = start_connect(cf, data, ctx->remotehost);
       if(result)
         return result;
       ctx->state = SCFST_WAITING;
       FALLTHROUGH();

Worth noticing maybe: this test case uses DoH. So curl wants to send two requests to the Doh Server (to resolve a host name to both IPv4 and IPv6). But the DoH server is identified by an IP address.

To send its DoH requests(s), curl will resolve IP addresses first (two, one for each transfer) to use when sending the request to the DoH server.

On macOS, curl uses the name resolver functions even for plain IP addresses, which is not done on other systems. (for address conversion reasons)

This crash happens already when setting up the two DoH requests. It appears to be the second of them (which thus resolves the same host name) that crashes. If I limit the DoH requests to just one (by asking for IPv4 only), it avoids the crash.

@bagder
Copy link
Member

bagder commented Dec 22, 2023

in Curl_once_resolved(), conn->dns_entry is assigned to the async response. But if the result was read from the cache, there is no async response so there is no assign. Then the conn->dns_entry pointer remains NULL and is later dereferenced.

Here are two commits that could be applied to help tracking this down better. A range of asserts and two (temporary) infof() calls.

0002-hostip-infof-debug-ouputs.patch
0001-hostip-add-several-DEBUGASSERT-s.patch

TODO 1.9: Add failed name resolves to the host cache, so that repeated
connection attempts do not immediately repeat the resolve request. Lifetime
of negative entries is set to half of the positive ones.

Refactor the hostip api so that Curl_resolv_check() also checks the cache,
removing the need for Curl_fetch_addr(). This way hostip callers only need
one function, and don't care if the result came from cache or not.
@zagor zagor force-pushed the cache-negative-name-resolves branch from 53d877c to 2f0ce53 Compare December 22, 2023 16:21
@zagor
Copy link
Member Author

zagor commented Dec 22, 2023

Fixed that bug, hopefully, and rebased on master.

@bagder
Copy link
Member

bagder commented Dec 24, 2023

docs/libcurl/opts/CURLOPT_DNS_CACHE_TIMEOUT.3 probably needs an update to describe this new behavior, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
name lookup DNS and related tech
Development

Successfully merging this pull request may close these issues.

None yet

3 participants