Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns: overflowing header size #1350

Open
tompaah opened this issue Dec 5, 2024 · 1 comment
Open

dns: overflowing header size #1350

tompaah opened this issue Dec 5, 2024 · 1 comment

Comments

@tompaah
Copy link

tompaah commented Dec 5, 2024

Got two RKE2-clusters. Internally RKE2 uses a CoreDNS-base service (rke2-coredns-rke2-coredns, image: rancher/hardened-coredns:v1.11.1-build20240305) to provide name resolution for the pods. These, in turn, pass to the hosts name resolution for upstream resolving.

Yesterday I switched the hosts name resolution over to Gravity, and this morning we had quite some problems in most of the RKE2 pods. The rke2-coredns-rke2-coredns pods were logging a lot of errors of this exact format
plugin/errors: 2 login.microsoftonline.com. A: dns: overflowing header size

Unfortunately I had to switch back from Gravity to the old resolution service, restarting the pods the error messages immediately disappeared.

It seems this host (login.microsoftonline.com) gives quite many answers, maybe the overflowing issue is related to this?

For the moment I can't replicate the issue, the clusters are semi-production and cannot experiment with them further. But I'd imagine setting up a chain like coredns -> gravity -> upstream and then querying the coredns for login.microsoftonline.com might replicate the issue.

@tompaah
Copy link
Author

tompaah commented Dec 5, 2024

Spinned up a coredns instance (here: 192.168.210.5 port 54) that forwards all queries to my Gravity instance (here: 192.168.210.8). So when querying the coredns instance, the path is client -> coredns -> Gravity -> internet.

If I query coredns with something that results in very large responses, the response begins with "Truncated, retrying in TCP mode.", for example with the login.microsoftonline.com lookup (have yet to find another name that results in such large output).

$ nslookup -port=54 login.microsoftonline.com 192.168.210.5
;; Truncated, retrying in TCP mode.
Server:		192.168.210.5
Address:	192.168.210.5#54

Non-authoritative answer:
login.microsoftonline.com	canonical name = login.mso.msidentity.com.
login.mso.msidentity.com	canonical name = ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com	canonical name = www.tm.ak.prd.aadg.trafficmanager.net.
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 20.190.181.5
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.9
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 20.190.181.4
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.17
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.8
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.10
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 40.126.53.7
Name:	www.tm.ak.prd.aadg.trafficmanager.net
Address: 20.231.128.67

Querying my Gravity instance directly does not result in the truncation and TCP retry, same output just without the truncation.

$ nslookup login.microsoftonline.com 192.168.210.8
Server:		192.168.210.8
Address:	192.168.210.8#53

Non-authoritative answer:
login.microsoftonline.com	canonical name = login.mso.msidentity.com.
login.mso.msidentity.com	canonical name = ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com	canonical name = www.tm.ak.prd.aadg.akadns.net.
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.149
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.83
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.22
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.10
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.85
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.19
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.5
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.9

Now, if I alter the path to swap out Gravityfor our old (and working) DNS forwarder:
client -> coredns -> dnsmasq -> internet
even the long login.microsoftonline.com query works without the truncating and retrying over TCP.

$ nslookup -port=54 login.microsoftonline.com 192.168.210.5
Server:		192.168.210.5
Address:	192.168.210.5#54

Non-authoritative answer:
login.microsoftonline.com	canonical name = login.mso.msidentity.com.
login.mso.msidentity.com	canonical name = ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com	canonical name = www.tm.ak.prd.aadg.akadns.net.
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.10
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.85
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.19
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.5
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.147.9
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.149
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.83
Name:	www.tm.ak.prd.aadg.akadns.net
Address: 20.190.177.22

So there appears to be something not working correctly when using Gravity that works when substituting Gravity for dnsmasq. Let me know if I can provide logs, pcaps or whatnot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant