Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolver review + test plan #1468

Closed
ermo opened this issue Jan 29, 2024 · 9 comments
Closed

DNS resolver review + test plan #1468

ermo opened this issue Jan 29, 2024 · 9 comments
Assignees
Labels
Priority: High High priority Topic: Platform Integration Integration of various components within Solus Topic: Plumbing Core components
Milestone

Comments

@ermo
Copy link
Contributor

ermo commented Jan 29, 2024

I recently fixed a race condition with multicast DNS resolution where systemd-resolved and avahi-daemon were fighting over who would be the authoritative mDNS resolver.

I wonder if we have other such bugs lurking?

We originally introduced systemd-resolved because it is, in some ways, the least bad alternative.

However, it might pay to develop a test plan with known steps which can verify that all enabled DNS-related facilities in Solus work as intended for general testing and release purposes. Could we put it up on the Help Center so we have step-by-step guide that users can follow when troubleshooting DNS, similar to how we have an ISO testing guide?

In no particular order:

  • nsswitch.conf -- do we have things set in the correct/desired order?
  • Does per interface DNS resolution work correctly (useful for VPNs)?
  • DNSSEC?
  • DNS over TLS?
  • LLMNR -- how do we test this? What uses it?
  • mDNS -- avahi-daemon vs. systemd-resolved? IPv4 or IPv4 + IPv6?
  • WSDD (so that Windows 10 devices can see Solus boxen with samba running when browsing the network from Windows Explorer)
@ermo ermo added Topic: Platform Integration Integration of various components within Solus Topic: Plumbing Core components labels Jan 29, 2024
@ermo ermo added this to the Solus 4.6 milestone Jan 29, 2024
@ermo ermo added this to Solus Jan 29, 2024
@github-project-automation github-project-automation bot moved this to Triage in Solus Jan 29, 2024
@ermo ermo added the Priority: High High priority label Jan 29, 2024
@ermo ermo moved this from Triage to Needs More Info in Solus Jan 29, 2024
@silkeh silkeh moved this from Needs More Info to In Progress in Solus Mar 23, 2024
@silkeh
Copy link
Member

silkeh commented Mar 23, 2024

nsswitch.conf -- do we have things set in the correct/desired order?

Yes.

Does per interface DNS resolution work correctly (useful for VPNs)?

Yes. Test plan with multiple links:

$ resolvectl query getsol.us
getsol.us: 2606:50c0:8001::153                 -- link: enp7s0f1
           2606:50c0:8000::153                 -- link: enp7s0f1
           2606:50c0:8003::153                 -- link: enp7s0f1
           2606:50c0:8002::153                 -- link: enp7s0f1
           185.199.109.153                     -- link: enp7s0f1
           185.199.108.153                     -- link: enp7s0f1
           185.199.111.153                     -- link: enp7s0f1
           185.199.110.153                     -- link: enp7s0f1

-- Information acquired via protocol DNS in 1.1ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: cache network
$ resolvectl domain enp7s0f1.2 getsol.us
$ resolvectl query getsol.us
resolvectl query getsol.us
getsol.us: 185.199.108.153                     -- link: enp7s0f1.2
           185.199.111.153                     -- link: enp7s0f1.2
           185.199.109.153                     -- link: enp7s0f1.2
           185.199.110.153                     -- link: enp7s0f1.2
           2606:50c0:8002::153                 -- link: enp7s0f1.2
           2606:50c0:8003::153                 -- link: enp7s0f1.2
           2606:50c0:8000::153                 -- link: enp7s0f1.2
           2606:50c0:8001::153                 -- link: enp7s0f1.2

-- Information acquired via protocol DNS in 1.4ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: cache network

DNSSEC?

Having it disabled by default seems the sane option for now, seeing the amount of upstream issues.
It would probably be useful to have some users test this for us.

DNS over TLS?

Adoption is lacking. The only setting that would work is opportunistic, which doesn't really add that much value imo.
Additionally, some of the issues with DNSSEC seem worse with DNS over TLS enabled 😕.

LLMNR -- how do we test this? What uses it?

According to Wikipedia it "is included in Windows Vista, Windows Server 2008, Windows 7, Windows 8, Windows 10" but "as of April 2022, Microsoft has begun the process of phasing out both LLMNR and NetBIOS name resolution in favour of mDNS."

Test plan is simple: have a device that does it and query it:

$ resolvectl query -p llmnr <snip>
<snip>: fe80::<snip>%5               -- link: enp7s0f1
     fd14:<snip>                     -- link: enp7s0f1
     2a02:<snip>                     -- link: enp7s0f1

-- Information acquired via protocol LLMNR/IPv6 in 178.5ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: network

mDNS -- avahi-daemon vs. systemd-resolved? IPv4 or IPv4 + IPv6?

Avahi seems to be most commonly used in the wild. Switching it might improve the stack complexity for now, but I don't think it is commonly used by other distributions. Note that switching also requires migrating service definitions to systemd.dnssd, although I think Samba is the only package affected..

However, the entry in nsswitch.conf should be dual stack in my opinion. See #1736.

WSDD (so that Windows 10 devices can see Solus boxen with samba running when browsing the network from Windows Explorer)

Seeing as Windows is moving to mDNS, I'm not sure if we want to enable one of the 'legacy' protocols. This needs some testing with Windows 11 and mDNS.

@ermo ermo assigned ermo and unassigned silkeh Apr 19, 2024
@ermo
Copy link
Contributor Author

ermo commented Apr 19, 2024

Just a note that I need to test samba-4.19.6 with systemd.dnssd, hence taking over assignment.

@ermo
Copy link
Contributor Author

ermo commented May 1, 2024

@silkeh / @ReillyBrogan :

After setting up my local system with the following:

#/etc/nsswitch.conf:
# turn off avahi mDNS resolution so mDNS becomes available to systemd-resolved
hosts: mymachines resolve [!UNAVAIL=return] files myhostname dns

#/etc/systemd/resolved.conf.d/enable-mdns-full.conf:
[Resolve]
MulticastDNS=true

#/etc/systemd/dnssd/smb.dnssd:
[Service]
Name=%H
Type=_smb._tcp
Port=445

#/etc/systemd/dnssd/smb-device-info.dnssd:
[Service]
Name=%H
Type=_device-info._tcp
TxtText=model=RackMac  

And doing a:

sudo systemctl mask avahi-daemon.socket
sudo systemctl disable --now avahi-daemon.socket
sudo systemctl stop avahi-daemon
sudo systemctl daemon-reload
sudo systemctl restart systemd-resolved
sudo systemctl status systemd-resolved

I get this message:

  May 01 16:57:00 solbox systemd-resolved[89019]: mDNS-IPv4: There appears to be another mDNS responder running, or previously systemd-resolved crashed with some outstanding transfers.
  May 01 16:57:00 solbox systemd-resolved[89019]: mDNS-IPv6: There appears to be another mDNS responder running, or previously systemd-resolved crashed with some outstanding transfers.

When I check, I see this:

ermo@solbox:/etc/systemd/dnssd
$ sudo ss -plntu |grep 53
udp   UNCONN 0      0                             127.0.0.54:53         0.0.0.0:*    users:(("systemd-resolve",pid=91077,fd=23))
udp   UNCONN 0      0                             127.0.0.53:53         0.0.0.0:*    users:(("systemd-resolve",pid=91077,fd=21))
udp   UNCONN 0      0                                0.0.0.0:5353       0.0.0.0:*    users:(("systemd-resolve",pid=91077,fd=15))
udp   UNCONN 0      0                                0.0.0.0:5353       0.0.0.0:*    users:(("kdeconnectd",pid=3188,fd=20))     
udp   UNCONN 0      0                                0.0.0.0:5355       0.0.0.0:*    users:(("systemd-resolve",pid=91077,fd=11))
udp   UNCONN 0      0                                      *:5353             *:*    users:(("systemd-resolve",pid=91077,fd=16))
udp   UNCONN 0      0                                      *:5353             *:*    users:(("kdeconnectd",pid=3188,fd=21))     
udp   UNCONN 0      0                                      *:5355             *:*    users:(("systemd-resolve",pid=91077,fd=13))
udp   UNCONN 0      0                                      *:55304            *:*    users:(("kdeconnectd",pid=3188,fd=23))     
tcp   LISTEN 0      5                          192.168.1.226:5357       0.0.0.0:*    users:(("python3",pid=1252,fd=10))         
tcp   LISTEN 0      4096                       127.0.0.53%lo:53         0.0.0.0:*    users:(("systemd-resolve",pid=91077,fd=22))
tcp   LISTEN 0      4096                             0.0.0.0:5355       0.0.0.0:*    users:(("systemd-resolve",pid=91077,fd=12))
tcp   LISTEN 0      4096                          127.0.0.54:53         0.0.0.0:*    users:(("systemd-resolve",pid=91077,fd=24))
tcp   LISTEN 0      4096                                [::]:5355          [::]:*    users:(("systemd-resolve",pid=91077,fd=14))
tcp   LISTEN 0      5      [fe80::6043:6efc:8c0f:955]%enp3s0:5357          [::]:*    users:(("python3",pid=1252,fd=14))

Which begs the question:

Why is kdeconnectd apparently listening on UDP port 5353 already?

@TraceyC77
Copy link
Contributor

I can confirm that KDE Connect is also listening on port 5353 on my Plasma install
According to the KDE Connect docs

KDE Connect uses dynamic ports in the range 1714-1764 for UDP and TCP.

This is probably KDE Connect also taking advantage of mDNS, which is normal (confirmed with a brief internet search). This is probably not the cause of the errors in your logs.
I also see it listening on the ports I'd expect

❯ sudo ss -plntu |grep kdecon
udp   UNCONN 216448 0                          0.0.0.0:5353       0.0.0.0:*    users:(("kdeconnectd",pid=2435,fd=21))   
udp   UNCONN 0      0                          0.0.0.0:52796      0.0.0.0:*    users:(("kdeconnectd",pid=2435,fd=16))   
udp   UNCONN 0      0                                *:1716             *:*    users:(("kdeconnectd",pid=2435,fd=19))   
udp   UNCONN 0      0                                *:35291            *:*    users:(("kdeconnectd",pid=2435,fd=17))   
udp   UNCONN 214080 0                                *:5353             *:*    users:(("kdeconnectd",pid=2435,fd=22))   
udp   UNCONN 0      0                                *:5353             *:*    users:(("kdeconnectd",pid=2435,fd=23))   
tcp   LISTEN 0      50                               *:1716             *:*    users:(("kdeconnectd",pid=2435,fd=20))   

Also from my internet search, the actual error can be caused by a conflict between mdns and avahi
I realize from what you wrote that you had stopped avahi and restarted systemd-resolved. How far down were the errors in the log? AFAIK the status output shows however many log lines, not just the ones from this start (correct me if I'm wrong).

Restarting systemd-resolved seems to have resolved that error for at least one person

On my system, I don't see that error from the status output. I searched the logs and don't see it there either.

@ermo
Copy link
Contributor Author

ermo commented May 1, 2024

I'm going to assume that the Linux networking stack knows how to multiplex packets to multiple listeners on a single UDP port as a matter of design.

The other error I was referring to fixing was avahi and systemd-resolved fighting over TCP port 5353. But that part is fixed.

In my test case, nothing but systemd-resolved was listening on TCP port 5353, so that part confirms that the test case is sound, valid and representative.

@ermo
Copy link
Contributor Author

ermo commented May 1, 2024

Long story short, I think this can be closed now.

I'm open to working on removing avahi-daemon from the base system in concert with switching samba over to systemd-resolved MulticastDNS=true default settings; it will essentially just be a build-time setting change for systemd.

@TraceyC77
Copy link
Contributor

Just to note that both avahi-daemon and systemd-resolved are both running on my system, and I don't have errors in the logs for either service.

@silkeh
Copy link
Member

silkeh commented May 1, 2024

Long story short, I think this can be closed now.

Maybe create two follow-up tasks: disabling LLMNR by default, and replacing avahi-daemon with systemd-resolved?

I'm open to working on removing avahi-daemon from the base system in concert with switching samba over to systemd-resolved MulticastDNS=true default settings; it will essentially just be a build-time setting change for systemd.

I'm also willing to work on this when I'm back from vacation. Note that the timeouts should be reduced, see #1736 (this comment specifically)

Just to note that both avahi-daemon and systemd-resolved are both running on my system, and I don't have errors in the logs for either service.

This is normal when MulticastDNS is not enabled. You can see if it is by the -mDNS in resolvectl.

@ermo
Copy link
Contributor Author

ermo commented May 1, 2024

Maybe create two follow-up tasks: disabling LLMNR by default, and replacing avahi-daemon with systemd-resolved?

Created #2442 and #2443 as suggested.

Closing this as completed.

@ermo ermo closed this as completed May 1, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Solus May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High High priority Topic: Platform Integration Integration of various components within Solus Topic: Plumbing Core components
Projects
Archived in project
Development

No branches or pull requests

3 participants