Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory corruption related core dumps #1017

Open
kerolasa opened this issue Feb 20, 2024 · 3 comments
Open

memory corruption related core dumps #1017

kerolasa opened this issue Feb 20, 2024 · 3 comments

Comments

@kerolasa
Copy link

Describe the bug
The bug has been discussed in unbound maillist.

https://lists.nlnetlabs.nl/pipermail/unbound-users/2024-February/008257.html

Short summary of the circumstances. The coredumps happen in locations that have bad network connectivity.
I have a feeling cache handling has something to do with the issue. Assuming that
is correct it is worthwhile to know following settings are in use.

serve-expired: yes
serve-expired-ttl: 3600
serve-expired-client-timeout: 500
infra-keep-probing: yes

To reproduce
Steps to reproduce the behavior:

  1. Not sure. Install lots of unbounds to poorly connected location, and wait long enough??

Expected behavior
No coredumps.

System:

  • Unbound version: first noticed with 1.17.1 still happening with 1.19.1.
  • OS: linux
  • unbound -V output:
Version 1.19.1

Configure line: --prefix=/usr --sysconfdir=/etc --disable-rpath --enable-dnscrypt --enable-dnstap --enable-pie --enable-relro-now --enable-systemd --enable-tfo-client --enable-tfo-server --with-libevent --with-libnghttp2 --with-pidfile=/run/unbound.pid --with-pythonmodule --with-pyunbound
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.11 19 Sep 2023
Linked modules: dns64 python respip validator iterator
DNSCrypt feature available
TCP Fastopen feature available

Additional information

Collection of 117 gdb backtraces: some-backtraces.tar.gz

@wcawijngaards
Copy link
Member

Can I ask about the compilation? In particular the coredump where it calls the key_cache_insert function in the process_dnskey_response function, and that fails. That seems to look like a miscompilation to me. That happens to me, when I git change to different versions and the dependency tracking is not good; so that the wrong object files are not compiled; causing the compiler to link in code that refers to different layout. This kind of error looks like it. Is the code from a git checkout and after a change in code, did not 'make clean' before making a new compile? Or some other dependency tracking issue, where files are copied or modified and the dates change, and there is a partial or previous compilation? In any case, a clean working directory, or 'make clean', before 'make', that would remove the problem, if the dependency tracking is an issue. I have seen similar issues with miscompilation for using experimental, eg. buggy, compiler options; like new optimizations. It could be good to use --disable-flto for that reason; the '-flto' option gives failures in a lot of reports, and one of the coredumps has the error that the code section in the core is the wrong size.

@kerolasa
Copy link
Author

When we compile unbound it is packaged and the same package is used across
systems here and there. There are many hundreds of active unbound instances
that do not show symptoms something in the binary would be wrong.

Oh, one thing more. We never make clean when building production release.
A docker container performing build will do it from clean slate starting
with downloading release package, checking checksum, then compiling and
packaging. Only the package is kept, rest goes to bit heaven.

To be transparent; here are our configure options.

        ./configure \
                --prefix=/usr \
                --sysconfdir=/etc \
                --disable-rpath \
                --enable-dnscrypt \
                --enable-dnstap \
                --enable-pie \
                --enable-relro-now \
                --enable-systemd \
                --enable-tfo-client \
                --enable-tfo-server \
                --with-libevent \
                --with-libnghttp2 \
                --with-pidfile=/run/unbound.pid \
                --with-pythonmodule \
                --with-pyunbound

@wcawijngaards
Copy link
Member

Could the program get executed with address debugging? Perhaps that can catch the offending activity. Two options, one is valgrind, run the program using valgrind. Another is to compile with libasan, the address sanitizer, with a configure line like CFLAGS="-fsanitize=address -g -O2 -DVALGRIND" CXXFLAGS=$CFLAGS ./configure ... I would then also put --disable-flto to disable that optimization from interfering.

The error can then perhaps be caught at the time when it writes wrongly. Not so much catching it later, when the data is corrupted and a failure happens. This could be much more frequent than that data is corrupted that causes a core dump, like when it overwrites harmlessly. Seeing this kind of error at the time and place where it happens, is a good way to find it, otherwise there are no clues as to where the issue is in the program code. Take care starting the program, the debugging may make it sluggish.

The asan configure line references a define for VALGRIND, and this is used to set the hash function in unbound to not cause false positives in the memory detector, even though the address sanitizer is not valgrind, the false positive removal is convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants