Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbound 1.20 Cachedb broken? #1064

Closed
whyisthisbroken opened this issue May 10, 2024 · 34 comments
Closed

Unbound 1.20 Cachedb broken? #1064

whyisthisbroken opened this issue May 10, 2024 · 34 comments

Comments

@whyisthisbroken
Copy link

whyisthisbroken commented May 10, 2024

Anyone else runs the actual unbound 1.20.0 where the cachedb don't work properly anymore?
My avg ttl goes up to 50-60ms from 4-5ms.

My setup is on a Rasperry Pi 4:
AGH (cache disabled)- Unbound (with redis cachedb) - DNScrypt
AdGuardHome resolve with 127.0.0.1:53 / [::1]:53 to Unbound ->
Unbound resolve from CacheDB or DNSCrypt with forward-addr: 127.0.0.1@6053 / forward-addr: ::1@6053

Since the update from 1.19.3 to 1.20.0 there is something broken. Compiled exactly the same.

All queries are 25-30ms, although obviously the cache responds with "get".
DBSize also growning and the log from unbound is without errors.
On verbostiy lvl 4 there are entries with info: redis ;; ->>HEADER<<- so i expect that unbound resolve from cachedb, but AdGuardHome show me over 30-50ms for example resolves.

The setup is the same but something doesn't work anymore correctly... It seems that unbound doesn't respect the cache entries and resolve again...

I'm back to 1.19.3 and everything works fine again...see screenshot.

Compiling Code:
./configure --build=aarch64-linux-gnu --prefix=/usr --includedir=\${prefix}/include --infodir=\${prefix}/share/info --libdir=\${prefix}/lib/aarch64-linux-gnu --mandir=\${prefix}/share/man --localstatedir=/var --runstatedir=/run --sysconfdir=/etc --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --with-libevent --with-libhiredis --with-libnghttp2 --with-pidfile=/run/unbound.pid --with-pythonmodule --with-pyunbound --with-rootkey-file=/var/lib/unbound/root.key --disable-dependency-tracking --disable-flto --disable-maintainer-mode --disable-option-checking --disable-rpath --disable-silent-rules --enable-cachedb --enable-dnstap --enable-subnet --enable-systemd --enable-tfo-client --enable-tfo-server CFLAGS="-O2"

With 1.19.3
image

With 1.20.0
image

@wcawijngaards
Copy link
Member

What could be happening is that the setup includes options that now act differently for cachedb, since the release. The release fixes problems that cachedb has with them, and it works, and this may be the cause of the change. Or perhaps it is going wrong because of the change.

There was a new option introduced, cachedb-check-when-serve-expired: yes in 1.20.0, it defaults to on. If that option is disabled, then the old behaviour should appear perhaps. But I think it is likely better enabled, because then serve expired options should work properly, also with cachedb.

In particular, if cachedb is configured but also serve-expired with the serve-expired-client-timeout, then the serve-expired-client-timeout is now respected. And that means that an expired response is not sent immediately from the cachedb, but instead it attempts to fully resolve the query within the time interval. And if this works, in the case queries take 17 - 25 ms, then that answer is used. For the case the time increase with a couple msec makes it look like that could be what is going on. If it was to fail and take a longer time the cachedb expired response is then used.

So it could be that for the server the expired options have started working because of the new expired and cachedb fixes in 1.20.0, and for this reason it spends time looking up stuff, instead of immediately responding with expired answers from cachedb. If also edns client subnet is involved, there are other fixes with cachedb and serve-expired and that as well.

@whyisthisbroken
Copy link
Author

whyisthisbroken commented May 10, 2024

So here is my Unbound Configuration -- maybe i miss something that should be changed with the new build:
The settings "serve-expired-client-timeout" is default and not set - equal 0.

server:
verbosity: 1
statistics-interval: 0
statistics-cumulative: no
extended-statistics: yes

#Modul Configuration
module-config: "validator cachedb iterator"

# |Root|
auto-trust-anchor-file: "/var/lib/unbound/root.key"
root-hints: "/var/lib/unbound/root.hints"

# Minimize logs
# Do not print one line per query to the log
log-queries: no
# Do not print one line per reply to the log
log-replies: no
# Do not print log lines that say why queries return SERVFAIL to clients
log-servfail: no
# Do not print log lines to inform about local zone actions
log-local-actions: no
# Do not print log lines that say why queries return SERVFAIL to clients
#logfile: /dev/null    
#LogFile
logfile: "/var/log/unbound.log"

interface: 0.0.0.0
interface: ::0
do-ip4: yes
do-udp: yes
do-tcp: yes
do-ip6: yes
prefer-ip6: yes

# Only give access to recursion clients from LAN IPs
#access-control: 127.0.0.1/32 allow
#access-control: 192.168.0.0/16 allow
#access-control: fc00::/7 allow
#access-control: fd80::/7 allow
#access-control: fe80::/7 allow
#access-control: ::1/128 allow

# Ensure privacy of local IP ranges
private-address: 192.168.0.0/16
private-address: 169.254.0.0/16
private-address: 172.16.0.0/12
private-address: 10.0.0.0/8
private-address: fd00::/8
private-address: fe80::/10

# Unbound local queries needs to be off if using stubby or dnscrypt
do-not-query-localhost: no


use-caps-for-id: yes
harden-glue: yes
harden-large-queries: yes
harden-dnssec-stripped: yes
harden-below-nxdomain: yes
harden-algo-downgrade: yes
harden-short-bufsize: yes
harden-referral-path: no
aggressive-nsec: yes

target-fetch-policy: "-1 -1 -1 -1 -1"    

edns-buffer-size: 1232

rrset-roundrobin: yes
val-clean-additional: yes

cache-min-ttl: 0
cache-max-ttl: 86400

# Prefetch
prefetch: yes
prefetch-key: yes
serve-expired: yes
serve-expired-reply-ttl: 0
so-reuseport: yes

hide-identity: yes
hide-version: yes
http-user-agent: "DNS"

do-daemonize: no
qname-minimisation: yes
deny-any: no
minimal-responses: yes

# Performance
num-threads: 4

# Cache/Slabs Settings
neg-cache-size: 4m
msg-cache-size: 128m
msg-cache-slabs: 8
rrset-cache-size: 256m
rrset-cache-slabs: 8
key-cache-size: 8m
key-cache-slabs: 8
infra-cache-slabs: 8
num-queries-per-thread: 4096
outgoing-range: 8192

incoming-num-tcp: 1000
outgoing-num-tcp: 1000

so-rcvbuf: 8m
so-sndbuf: 8m

unwanted-reply-threshold: 100000

server:
forward-zone:
	name: "."

# DNScrypt proxy
forward-addr: 127.0.0.1@6053
forward-addr: ::1@6053

cachedb:
backend: redis
redis-server-path: "/var/run/redis/redis.sock"
redis-timeout: 100
redis-expire-records: no

If i understand that right, the settings "cachedb-check-when-serve-expired: yes" changed to NO could be not the solution - cause the explanaition is:
If enabled, the cachedb is checked before an expired response is returned. When [serve-expired] is enabled, without [serve-expired-client-timeout], it then does not immediately respond with an expired response from cache, but instead first checks the cachedb for valid contents, and if so returns it.

So - this is the behavior I want... or have I misunderstood something?

I think the following fixes do something that must be explained a little bit more from the Devs, i dont know why the entire cachedb function is completly different with this build... because - what is fixed if the previous build worked better ?!?

- Fix cachedb for serve-expired with serve-expired-reply-ttl.
- Fix to not reply serve expired unless enabled for cachedb.
- Fix cachedb for serve-expired with serve-expired-client-timeout.
- Fixup unit test for cachedb server expired client timeout with a check if response if from upstream or from cachedb.
- Fixup cachedb to not refetch when serve-expired-client-timeout is used.

@wcawijngaards
Copy link
Member

Yes those are the changes that make things different, and the new config option can control them; if that option is turned to a different value it should behave in a different manner. And perhaps in the old manner, but I can also not be sure because I do not know what it did before and now. Perhaps it is possible to find out what it did before and now, for one query, with verbosity: 5 and then look at the output to see what it did before and what it does now.

The cachedb function changed to fix it to make this possible, previously it would not actually pick up expired data from cachedb correctly, if serve expired is enabled, but instead pick it up as briefly valid data. The change makes it pick up the data and set it expired.

Another change is perhaps that it can ignore the expired contents from the cache, in order to check the cachedb, perhaps redis is the slow component, and now that it incorporates the expired data as expired, it ignores it the next lookup, because it is expired and checks cachedb again for possibly fresh data.

@whyisthisbroken
Copy link
Author

Mhm.. maybe I've wait for other build and watch where it's going. Otherwise, Unbound is left on version 1.19.3. The build at least runs smoothly - at least I mean that😅

@kkkgo
Copy link

kkkgo commented May 10, 2024

I'm a bit confused about the cachedb-check-when-serve-expired option. From my understanding, this option is intended to check whether there are unexpired caches in the cachedb. If I set serve-expired-ttl: 0, which disables the limit, does this option become meaningless, merely adding an extra checking step that causes delays in the processing flow? Should I turn it off?

@jcbvm
Copy link

jcbvm commented May 11, 2024

I’m encountering the same behavior with my docker setup for unbound in combination with redis. When cachedb-check-when-serve-expired is set to yes (default), queries getting a resolve time of like 20ms after some time has passed since last query (I have serve expired enabled with 24h as max TTL). When cachedb-check-when-serve-expired is set to no the expected behavior is happening, queries are instantly served from cache. Would this indicate that searching the cachedb/redis takes up to 20ms?

@whyisthisbroken
Copy link
Author

whyisthisbroken commented May 11, 2024

I don't know, but this will not make sense. The cachedb via socket is really fast and in the build 1.19.3 you can see via logs that the queries are answered from redis. And that with 3-5ms. The cache from unbound is no longer active when you work with Redis. Or did I overlook something?
Fact is - in the build 1.20 is something wrong.. and the cachedb is definitely broken. The new settings are redundant and very confusing...

@jcbvm
Copy link

jcbvm commented May 11, 2024

@whyisthisbroken i always thought redis was only used as second level cache. So unbound still uses the in memory cache, with a fallback to redis if the no record is found in the in memory cache. But I indeed see lookups to redis even when the in memory cache should have the record. I’m unsure what the new option really does…

@whyisthisbroken
Copy link
Author

I hope that some developer gives some clarification and can explain in an understandable way what has now been changed and why the behavior is completely different...
Until then I'll stick with 1.19.3. It just works

@jcbvm
Copy link

jcbvm commented May 12, 2024

Experimented a little today. When I set cachedb-check-when-serve-expired to no, queries are served immediately, even when expired. This is the expected behavior. However, when I restart unbound the same query will not immediately be returned from redis, from the logs it looks like a normal resolve takes place, which is strange in my opinion, from my understanding it should just serve the query from redis as stale record. This way using redis is only useful for non expired queries.

@whyisthisbroken
Copy link
Author

whyisthisbroken commented May 12, 2024

Yeah, the entire cachedb is with build 1.20 completely useless.

This is the result with 1.20 I've run now for 12h...
That's not what it should be - look all the cache misses

Screenshot_20240512-203012~3.png

I'll go back to 1.19.3 - I don't know what's wrong, but as long no dev reports or a new version comes up i will not test it any further.

I'll post a screenshot tomorrow with build 1.19.3 with the same settings - of course without the new ones added for 1.20.

@wcawijngaards
Copy link
Member

So, I asked for detailed logs and you do not give them, so it is not sure what is actually happening apart from speculation? Just looking at the timing does not actually answer the questions? So, running with a verbosity level of 4 and 5, you can test what actually happens that makes queries take longer.

Since the queries are actually resolved, I do not see this as a problem. Because the resolution works.

The option can change between the old behaviour and the new behaviour.

The correct setting is the new one, because expired information needs to be categorized as expired. Despite what is commented earlier, the data shown seems to not show actual bugs apart from curiosity at the timeframe.

The setting to change serve-expired-ttl is not the same as serve-expired-client-ttl. The setting of 0 changed the answer to the clients, but did not change how unbound looks up records. Regardless of the change in the new option.

Of course, there could be bugs, and that would be nice to fix them. I actually do not see any bugs in the 1.20.0 implementation here.

Of course, if you get fresh lookups with a serve-expired-client-timeout, then the responses are classified as cache misses.

With the new option enabled, it can in fact pick up expired responses from the cachedb, and use them. Previously it would briefly mark them as present, giving trouble, and making expired messages really expired makes sure that domains do not swap between older and newer versions of themselves, so that is not really something that is wrong here.

If there is a reload, of course the cache is lost, and also no expired messages are present, in memory. The cachedb cache can then give an expired response. With the new option difference perhaps this is counted as a cache miss now, but it should have been but may have been counted as a cache hit before. Regardless the user gets the expired answer, if so, and the resolution continues. With the new option, the user may then, after the brief lookup, receive the new lookup answer if that happens to be able to be fetched within the serve-expired-client-timeout timeframe.

@jcbvm
Copy link

jcbvm commented May 13, 2024

Hi, thanks for the response.

The last part of your explanation is where I'm having trouble. When you set serve-expired-client-timeout to 0 (disabling it), serve-expired to true and cachedb-check-when-serve-expired to no, I would expect the existing expired record van de cachedb to be returned immediatelly after a reload (when the memory cache is gone). It should not wait on a new resolve, because serve-expired-client-timeout is set to 0.

But instead this response takes a couple of tens of milliseconds, indicating it waiting for a resolve. I'm not sure if this is a bug or I'm misunderstanding something.

@wcawijngaards
Copy link
Member

Well with the new option to no, it does the previous behaviour. So that would be the same as 1.19 does.

Perhaps redis is that slow, that could be visible from the logs to see what part is slow. It would be useful to have an option that logs milliseconds for that, but log-time-ascii does not do it. Or it is resolving before the expired response but I also do not know why it would, but see below how I do, if that is broken, intend to then remove the option altogether and not fix it.

Regardless of these issues, the fix with the new option yes it really a mandatory fix. Without it the serve-expired-client-timeout does not work. Also without it the information about a domain can fluctuate between different versions of the zone, once it is updated and also expired. The fix improves consistency and makes the serve expired options actually work, so the option turned on is a nicety. Realistically I think it should be removed and the current default the only choice, also the 1.19 behaviour, regardless of whether it was nice for you, has to go, because the bug has to be fixed for it. That said, it should also work, in this case. Once we figure out what that is, or what the problem, if any, is in this case.

@jcbvm
Copy link

jcbvm commented May 13, 2024

I was not specifically referring to 1.19. I myself do not have the option to test that version at the moment. I was just wondering how the current setting works. I will try to look in the logging if I can find something about slowness.

The problem I had with setting cachedb-check-when-serve-expired to yes, is that now and then some expired entries did take additional time to resolve which could indeed indicate a problem with redis. I would have to investigate this further.

@wcawijngaards
Copy link
Member

Well if the cache has an expired message, then with the new option enabled, it will actually check cachedb for a fresh new message. This is perhaps the change in behaviour that is the trouble.

@whyisthisbroken
Copy link
Author

whyisthisbroken commented May 13, 2024

So if I understood that correctly and I set "Check cachedb" to "No", even after restart Unbound and the cache is gone, does Unbound no longer look in the Redis database? So the whole Redis story would be unnecessary since it is no longer checked whether data is in the DB. This results, Redis is completely unnecessary, because the time for Redis lookups is sometimes slower than if I let it resolve again from Unbound...
Well... I think I go without redis and use unbound without persistent cache...maybe the best and fastest solution.

@jcbvm
Copy link

jcbvm commented May 13, 2024

I’ve checked the logs again, from what I can see the time is coming from cached entries which needs validation before being served:

[1715628891] unbound[1:1] debug: Serve expired: Trying to reply with expired data
[1715628891] unbound[1:1] debug: Serve expired: unchecked entry needs validation

This causes a resolve to happening sometimes, which explains the time elapsed. But I wonder what the purpose is of having a cachedb if eventually a resolve is needed anyway.

@kkkgo
Copy link

kkkgo commented May 14, 2024

Based on the conversation, I wonder if the cachedb-check-when-serve-expired option is unnecessary for a single-instance setup of Unbound. Here is my understanding:

  1. No Cached Result in Unbound: If there's no cached result in Unbound, it should check Redis for a cache. If Redis has a valid cache, it would provide the result unquestionably. If the Redis cache is expired, whether to use this expired result depends on the serve-expired-ttl setting. This behavior seems unrelated to the cachedb-check-when-serve-expired option.

  2. Valid Cached Result in Unbound: If Unbound has a valid cached result, it will return this result directly, which again, has no relation to the cachedb-check-when-serve-expired option.

  3. Expired Cached Result in Unbound: Here's where it gets interesting. In my understanding, the cachedb-check-when-serve-expired option would matter in this case. If set to no, it should behave like the first scenario, where the use of the expired result depends on the serve-expired-ttl setting, without checking Redis. If set to yes, it might try to check Redis for a non-expired result. This seems ideal; however, theoretically, if I'm running a single-instance of Unbound with Redis, and Redis data is always lagging behind Unbound, then checking seems unnecessary. Because if there is fresher data, it would definitely exist in Unbound’s cache rather than in Redis’s cache. Therefore, any data expired in Unbound’s cache would likely also be expired in Redis’s cache, if it exists at all.

Based on this analysis, it seems to me that this option would only add unnecessary Redis checks for a single-instance of Unbound. I'm not sure if my understanding is correct. It's just a summary based on the conversation and developers' explanations. Please feel free to correct any errors.

@jcbvm
Copy link

jcbvm commented May 14, 2024

@kkkgo You are mostly right indeed. That’s why the option cachedb-check-when-serve-expired is mandatory in this case and in your situation it would probably be better to set it to no.

In step 3 it is theoretically possible to have fresh records if your restart unbound which will delete the in memory cache. But today’s TTL’s are so low that this chance is also low.

So basically the option cachedb-check-when-serve-expired is useful in the end and I don’t think it should be left out. The only thing I wonder is why there happens to be a resolve after getting the item out of the cachedb, in my opinion it should be served immediately if serve-expired is enabled and the time is within serve-expired-ttl.

@kkkgo
Copy link

kkkgo commented May 14, 2024

@jcbvm When you restart Unbound, it's essentially equivalent to scenario 1, as described above. In this case, I believe it is unrelated to the cachedb-check-when-serve-expired option.

@whyisthisbroken
Copy link
Author

So, for me the issue is closed. But I left it open for others.
I completely removed redis. So I don't have this high resolution times...

For me, Redis is no longer an alternative than a secondary cache with the new options. It's not worth looking for an entry via Redis if it's faster to resolve again.

@5kft
Copy link

5kft commented May 17, 2024

Yes, these changes appear to completely break the use of cachedb. Prior to 1.20.0, I used cachedb + redis to provide a cache following a restart of the Unbound service or reboot of the DNS server, but now Unbound essentially doesn't use redis at all.

Instead of returning an entry from the redis cache following a reboot, Unbound does a lookup instead (?), even if the lookup takes a long time (e.g., 200ms on some test domains). This all happens even if "serve-expired-client-timeout" is set to zero (and I have "service-expired-ttl" set to "259200" as well). Maybe there is some millisecond comparison going on in Unbound, but I have to believe that a 200ms DNS lookup is a couple of orders of magnitude slower than looking up an entry from redis...

@wcawijngaards
Copy link
Member

There seems to be separate problem, perhaps, where it resolves instead of looks up in cachedb, as different from other cases, where it looks at cachedb before returning a reply. With the client timeout disabled, so it should either respond immediately with a cache response, or expired response, possibly after a cachedb lookup of course, but not then wait for recursion. Are there logs, say verbosity 4 or 5, of this going wrong? It should check the cachedb if that is configured.

@jcbvm
Copy link

jcbvm commented May 17, 2024

In have a log file here with verbosity 4. It looks at the cachedb and finds an item. But instead of serving it right away it starts resolving. I have serve-expired-client-timeout on 0 and serve-expired-ttl on 1 day.
Example.log

@5kft
Copy link

5kft commented May 17, 2024

What I am seeing is that if the cachedb entry hasn't expired, it returns the cachedb entry and doesn't resolve and is done - so this situation works correctly. However, if the cachedb entry has expired, now in 1.20.0 it ignores the cachedb entry and resolves instead, blocking the return of the record until it is resolved, and then it returns the new resolved records.

It shouldn't do this if "serve-expired-client-timeout" is zero and "serve-expired-ttl" is set to 1-3 days, right? Why doesn't it return the expired record with TTL 0 and then resolve in background, like it did previously to 1.20.0? This new behavior makes cachedb useless except for cases where the Unbound cache happens to be cleared within the record's TTL (?), which seems to be very unlikely to normally occur...?

[1715966761] unbound[2363:3] info: ::1 www.st.com. A IN
[1715966761] unbound[2363:3] debug: mesh_run: start
[1715966761] unbound[2363:3] debug: validator[module 0] operate: extstate:module_state_initial event:module_event_new
[1715966761] unbound[2363:3] info: validator operate: query www.st.com. A IN
[1715966761] unbound[2363:3] debug: validator: pass to next module
[1715966761] unbound[2363:3] debug: mesh_run: validator module exit state is module_wait_module
[1715966761] unbound[2363:3] debug: cachedb[module 1] operate: extstate:module_state_initial event:module_event_pass
[1715966761] unbound[2363:3] debug: redis_lookup of 25B7D0C0AE8DC78B407DBAD8372071F1E2B8D5C31FB47D8DBA65D2C05FAC8F22
[1715966761] unbound[2363:3] debug: redis_lookup found 213 bytes
[1715966761] unbound[2363:3] debug: cachedb msg expired
[1715966761] unbound[2363:3] debug: cachedb msg adjusted down by -1
[1715966761] unbound[2363:3] info: redis ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr rd ra ; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
www.st.com.     IN      A

;; ANSWER SECTION:
www.st.com.     0       IN      CNAME   www.st.com.edgekey.net.
www.st.com.edgekey.net. 0       IN      CNAME   www.st.com.edgekey.net.globalredir.akadns.net.
www.st.com.edgekey.net.globalredir.akadns.net.  0       IN      CNAME   e40218.b.akamaiedge.net.
e40218.b.akamaiedge.net.        0       IN      A       23.46.216.209
e40218.b.akamaiedge.net.        0       IN      A       23.46.216.205

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 186

[1715966761] unbound[2363:3] debug: Serve expired: Trying to reply with expired data
[1715966761] unbound[2363:3] debug: Serve expired: unchecked entry needs validation
[1715966761] unbound[2363:3] debug: mesh_run: cachedb module exit state is module_wait_module
[1715966761] unbound[2363:3] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
[1715966761] unbound[2363:3] debug: process_request: new external request event
[1715966761] unbound[2363:3] debug: iter_handle processing q with state INIT REQUEST STATE
[1715966761] unbound[2363:3] info: resolving www.st.com. A IN
[1715966761] unbound[2363:3] debug: request has dependency depth of 0
[1715966761] unbound[2363:3] info: priming . IN NS
[1715966761] unbound[2363:3] debug: mesh_run: iterator module exit state is module_wait_subquery
[1715966761] unbound[2363:3] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
[1715966761] unbound[2363:3] info: iterator operate: query . NS IN
[1715966761] unbound[2363:3] debug: iter_handle processing q with state QUERY TARGETS STATE
[1715966761] unbound[2363:3] info: processQueryTargets: . NS IN 
[1715966761] unbound[2363:3] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
[1715966761] unbound[2363:3] info: DelegationPoint<.>: 13 names (0 missing), 26 addrs (0 result, 26 avail) parentNS
[1715966761] unbound[2363:3] info:   A.ROOT-SERVERS.NET. * A AAAA
[1715966761] unbound[2363:3] info:   B.ROOT-SERVERS.NET. * A AAAA
[1715966761] unbound[2363:3] info:   C.ROOT-SERVERS.NET. * A AAAA
[1715966761] unbound[2363:3] info:   D.ROOT-SERVERS.NET. * A AAAA
[1715966761] unbound[2363:3] info:   E.ROOT-SERVERS.NET. * A AAAA
[1715966761] unbound[2363:3] info:   F.ROOT-SERVERS.NET. * A AAAA
              ...

@5kft
Copy link

5kft commented May 19, 2024

For what it's worth (which is almost certainly not very much), undoing one of the changes made in 7c5e765 appears to work around these new issues with cachedb being broken for the "serve-expired-client-timeout: 0" case:

--- cachedb.c   2024-05-19 10:13:35.595468320 -0700
+++ cachedb.c   2024-05-19 10:20:36.734117348 -0700
@@ -827,7 +827,8 @@
                 * TODO: this needs revisit. The expired data stored from cachedb has
                 * 0 TTL which is picked up by iterator later when looking in the cache.
                 */
-               if(qstate->env->cfg->serve_expired && msg_expired) {
+               if(qstate->env->cfg->serve_expired && msg_expired &&
+                       qstate->env->cfg->serve_expired_client_timeout) {
                        qstate->return_msg = NULL;
                        qstate->ext_state[id] = module_wait_module;
                        /* The expired reply is sent with

With this change everything works again, just like it used to (i.e., prior to 1.20.0).

wcawijngaards added a commit that referenced this issue May 21, 2024
  insecure, and thus can be served to clients when dnssec is enabled.
@wcawijngaards
Copy link
Member

The commit fbdc06e contains a fix for the example log and for the excerpt posted. It sets the dnssec validation status of the expired cachedb contents to insecure. This allows it to be returned to the client when DNSSEC is enabled. Otherwise it would need dnssec validation, but the message is expired and the validator does not work on that; it works on currently looked up data. With the fix, I think, it is going to make the two examples return an expired reply straight away, and then also continue to resolve. The fixup suggested points in the right direction, but is not the fix to use; it would return a full reply but not resolve a fresh entry.

@5kft
Copy link

5kft commented May 21, 2024

Hi @wcawijngaards, thanks for the change! I have a couple of questions: Will this change work with DNSSEC validation disabled (as I am not using DNSSEC)? It does seem to be okay for this case. Also, this change requires "cachedb-check-when-serve-expired" to be enabled, correct?

I did some quick tests and this new change appears to address the issue, albeit behavior is now a bit different than in 1.19.x (?). I just want to confirm: Instead of returning a zero (0) TTL from the cachedb result, it now returns the value for "serve-expired-reply-ttl" (which appears to make more sense than zero anyway).

Thank you!

@kkkgo
Copy link

kkkgo commented May 21, 2024

I have the same question. Does this fix address the case when cachedb-check-when-serve-expired is set to yes? If I set cachedb-check-when-serve-expired to no, will it have the same effect? In a scenario where the in-memory cache always has fresher DNS records than Redis, wouldn't it be more appropriate to set it to no?

@wcawijngaards
Copy link
Member

Yes it would work with validation disabled. In fact, things should already work for that. It requires the cachedb-check-when-serve-expired. It could apply when that is disabled too, but I wanted to preserve the old behaviour that you are asking for.

That the serve-expired-reply-ttl is honored, is a new feature for the cachedb-check-when-serve-expired option, also the serve-expired-client-timeout works. It is nice that it works.

Supposedly when cachedb-check-when-serve-expired is disabled, then it should already be working like in previous versions and no change needed. Hence the fix does not change that.

@wcawijngaards
Copy link
Member

I have closed the issue with the commit, this adds a unit test to the previous commit that fixes the validation status. This makes the new 1.20.0 cachedb changes work with validation. The unit test fails with similar debug output as the shown logs without the validation status fixup, and with the validation status fixup the unit test works and produces neat serve expired responses from cachedb. It shows that the fix up works and makes this kind of situation work with the new cachedb code, and this is nice because that code supports options like serve-expired-client-timeout and serve-expired-reply-ttl.

@jcbvm
Copy link

jcbvm commented May 24, 2024

Thank you for taking the time for this issue and for the provided fix.

@5kft
Copy link

5kft commented May 24, 2024

+1 @wcawijngaards - the new behavior does work nicely, thank you very much for looking into this and addressing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants