Improve _ChunkedValue's handling of split chunks with unexpected whitespace #1172

ross · 2024-05-08T22:09:54Z

This allows arbitrary whitespace before the opening ", after the closing ", or in between the chunks. Lots of tests.

For the life of me I couldn't find an RFC section that covered chunked TXT records details.

https://en.wikipedia.org/wiki/TXT_record points to https://datatracker.ietf.org/doc/html/rfc1035 & https://datatracker.ietf.org/doc/html/rfc6763 neither of which mention it afaict (with a quick skim/search) and I search all of the RFCs linked off of 1035 as well. 🤷

/cc octodns/octodns-hetzner#37 @peterablehmann @istr

…espace

istr · 2024-05-09T14:10:31Z

@ross

For the life of me I couldn't find an RFC section that covered chunked TXT records details.

You can find it distributed in RFC 1035.

5.1. Format
The format of these files is a sequence of entries. Entries are predominantly line-oriented, though parentheses can be used to continue a list of items across a line boundary, and text literals can contain CRLF within the text. Any combination of tabs and spaces act as a delimiter between the separate items that make up an entry.
[...]
Blank lines, with or without comments, are allowed anywhere in the file.
[...]
is expressed in one or two ways: as a contiguous set of characters without interior spaces, or as a string beginning with a " and ending with a ".
3.3.14. TXT RDATA format
TXT-DATA One or more s.

Since character-string is a separate item in the spec, and items can have "any combination of tabs and spaces as a delimiter", you have it. Either two character-strings are on separate lines (possibly with blank lines in between) or they are separated by at least one space or tab.

This specification only applies to master files, strictly speaking. Over the wire, there is no space between the elements. However, I think it makes sense to allow relaxed whitespace semantics when it comes to HTTP-based API transport.

istr

I would add error handling for malformed data where v starts with " but still does not end with " after the strip operation.

istr · 2024-05-09T14:18:14Z

octodns/record/chunked.py

@@ -62,9 +63,11 @@ def validate(cls, data, _type):
    def process(cls, values):
        ret = []
        for v in values:
+            # remove leading/trailing whitespace
+            v = v.strip()
            if v and v[0] == '"':


Throw an error if v and v[0] == '"' and v[-1] != '"' instead of silently allowing a malformed value.

Makes sense & will look into this. Broadly speaking the error handling/checking happens in validate rather than process so that lenient=True can do a best effort conversion if told to do so by the user. This would need to be taken into account if there's a shift to a parser style setup as discussed above, it would need to be as forgiving as possible, and need to be able to (more) strictly check things during validate.

istr · 2024-05-09T14:21:10Z

octodns/record/chunked.py

@@ -34,6 +34,7 @@ def rr_values(self):

 class _ChunkedValue(str):
    _unescaped_semicolon_re = re.compile(r'\w;')
+    _chunk_sep_re = re.compile(r'"\s+"')


Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

In any case I would change the implementation to properly tokenize character-string as per spec:

Treat the leading and trailing " to be part of each chunk (= character-string), not "\s+" to be a separator.

Check for wellformedness (unescaped " pair matching).

Unescape embedded \" as per spec.

Due to the definition of character-string I guess that a walk/state-based approach would work better than a regex based approach:

set target_value to empty

walk whitespace (i.e. strip leading whitespace)

set chunk_start_pos to pos

set quoted_flag if v[pos] == '"' else reset quoted_flag

walk to next " or \s depending on quoted flag

if quoted_flag and '\' == v[pos-1]: append to target_value from chunk_start_pos to pos - 2, repeat from 5. (this means: keep walking the current quoted chunk)

append to target_value from chunk_start_pos to pos - 1 (this means: finish any current chunk)

repeat from 2.

throw error if quoted_flag and '"' != v[-1]

If you want to stick with the simple regex approach, I would tend to use "\s*" to be the separator rather than \s+.

Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

Interesting.

In any case I would change the implementation to properly tokenize character-string as per spec:
...

Will explore this, and probably check and see what other DNS libs do for it as well.

Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

Double checking my current read here. Unquoted values cannot contain spaces? The old code definitely allowed that, but tbh no clue if it actually ever saw it in the real world. Parser-wise I'm not really sure how we'd support mixing quoted and unquoted if the unquoted bits can have spaces.

Figured the best/easiest thing to do would be to turn to octodns-bind and zonefile provider to see how it behaves since it's using dnspython for reading the zonefile:

Zonefile

one IN TXT no quotes in this one two IN TXT "this one was quoted" three IN TXT "this one has both, this quoted bit" and this unquoted section four IN TXT "multiple quoted" "sections in this one" five IN TXT "same with multiple spaces" "between the quotes" six IN TXT "leading spaces before" "and trailing spaces after" seven IN TXT "first value" seven IN TXT "second value"

Data loaded from that

[Rr<exxampled.com., NS, 86400, dns1.exxampled.com., Rr<exxampled.com., NS, 86400, dns2.exxampled.com., Rr<one.exxampled.com., TXT, 86400, "no" "quotes" "in" "this" "one", Rr<two.exxampled.com., TXT, 86400, "this one was quoted", Rr<three.exxampled.com., TXT, 86400, "this one has both, this quoted bit" "and" "this" "unquoted" "section", Rr<four.exxampled.com., TXT, 86400, "multiple quoted" "sections in this one", Rr<five.exxampled.com., TXT, 86400, "same with multiple spaces" "between the quotes", Rr<six.exxampled.com., TXT, 86400, "leading spaces before" "and trailing spaces after", Rr<seven.exxampled.com., TXT, 86400, "first value", Rr<seven.exxampled.com., TXT, 86400, "second value"]

And what octoDNS currently makes of that:

******************************************************************************** * exxampled.com. ******************************************************************************** * config (YamlProvider) * Create Zone<exxampled.com.> * Create <NsRecord NS 86400, exxampled.com., ['dns1.exxampled.com.', 'dns2.exxampled.com.']> () * Create <TxtRecord TXT 86400, five.exxampled.com., ['same with multiple spacesbetween the quotes']> () * Create <TxtRecord TXT 86400, four.exxampled.com., ['multiple quotedsections in this one']> () * Create <TxtRecord TXT 86400, one.exxampled.com., ['noquotesinthisone']> () * Create <TxtRecord TXT 86400, seven.exxampled.com., ['first value', 'second value']> () * Create <TxtRecord TXT 86400, six.exxampled.com., ['leading spaces beforeand trailing spaces after']> () * Create <TxtRecord TXT 86400, three.exxampled.com., ['this one has both, this quoted bitandthisunquotedsection']> () * Create <TxtRecord TXT 86400, two.exxampled.com., ['this one was quoted']> () * Summary: Creates=8, Updates=0, Deletes=0, Existing Records=0 ********************************************************************************

That pretty well lays out the scope of possible behavior.

It is clear that octodns-bind wouldn't round trip unquoted items. I think results would semantically be the same, but the actual text written out file will all be quoted.:

five 86400 IN TXT "same with multiple spacesbetween the quotes" four 86400 IN TXT "multiple quotedsections in this one" one 86400 IN TXT "nospacesinthisone" seven 86400 IN TXT "first value" 86400 IN TXT "second value" six 86400 IN TXT "leading spaces beforeand trailing spaces after" three 86400 IN TXT "this one has both, this quoted bitandthisunquotedsection" two 86400 IN TXT "this one was quoted"

I don't see a way this could be avoided since octoDNS internally represents a value as an single arbitrary length string which I think is FAR preferable to avoid every user having to deal will all the complexities that we're no looking at.

So my plan for a path forward is to explore using dnspython's code for parsing the data. I remember looking into it back when I first did all the rdata stuff, but didn't end up using it and I don't recall exactly why. If that doesn't end up making sense I'll look at a parser setup that will match the behavior.

So I have a setup that's using dnspython to do the handling of the strings. It's not super straightforward since dnspython internally holds things as a list of bytes, but that can be decoded into what octoDNS needs. The other "extra" bit here is that dnspython does not escape ;, which is one of the biggest regrets I have with octoDNS. I so wish I didn't decide that they needed to be escaped in TXT values. I just put it on my TODO list to explore whether we can get rid of that as part of a major release, but ...

A side effect of all this digging work is that a bunch of the existing TXT rdata test cases, all the ones w/o quotes, fail with the changes as there was an incorrect assumption that whitespace should be preserved. I don't think any rdata commonly works with unquoted TXT values and looking at the code/uses I think that change is fine, but just wanted to call it out/record it here.

Scratch that, dnspython route isn't viable as there are validation checks preventing it from processing tokens > 255 chars. I can't feed it pieces of the value since the parsing looking for quotes would span the chunks.

ross

Any combination of tabs and spaces act as a delimiter between the separate items that make up an entry.

Guess it's good that I added tests/handling for both spaces & tabs then.

is expressed in one or two ways: as a contiguous set of characters without interior spaces, or as a string beginning with a " and ending with a ".

Thanks for digging this up. I was looking for something more specifically tied to TXT/SPF records, wasn't really thinking about it being a more generalized separator.

ross · 2024-05-09T15:28:21Z

octodns/record/chunked.py

@@ -34,6 +34,7 @@ def rr_values(self):

 class _ChunkedValue(str):
    _unescaped_semicolon_re = re.compile(r'\w;')
+    _chunk_sep_re = re.compile(r'"\s+"')


Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

Interesting.

In any case I would change the implementation to properly tokenize character-string as per spec:
...

Will explore this, and probably check and see what other DNS libs do for it as well.

ross · 2024-05-09T15:30:04Z

octodns/record/chunked.py

@@ -62,9 +63,11 @@ def validate(cls, data, _type):
    def process(cls, values):
        ret = []
        for v in values:
+            # remove leading/trailing whitespace
+            v = v.strip()
            if v and v[0] == '"':


Makes sense & will look into this. Broadly speaking the error handling/checking happens in validate rather than process so that lenient=True can do a best effort conversion if told to do so by the user. This would need to be taken into account if there's a shift to a parser style setup as discussed above, it would need to be as forgiving as possible, and need to be able to (more) strictly check things during validate.

ross · 2024-05-10T22:13:45Z

This turned out to be involved. The existing parse_rdata_text didn't follow the spec at all. It just took the rdata as-is after replacing ; with \\;. That bit is required due to poor decisions years ago, but otherwise not actually unquoting quoted sections and including them verbatim was wrong as was the lack of concatenating whitespace separated chunks.

It should now adhere to the spec and convert things to octoDNS's format (plain strings) correctly.

The Record.new rdata handling should now correctly parses quoted strings and handles escaped quotes. It does not/can not deal with unquoted whitespace separated chunks as it doesn't know when it's getting rdata and when it's getting a plain octoDNS style string. So basically quoted means treat it as rdata items. Unquoted means use it as-is verbatim.

I don't think anything actually uses non-quoted txt rdata so I think this is a non-issue or else problems would have come up before...

Anyway this should change behavior beyond fixing previous problems/shortcommings, but it is a BIG change to the logic/code so it needs lots of 👀 and careful thinking before moving forward.

Janik-Haag · 2024-05-19T17:04:36Z

en.wikipedia.org/wiki/TXT_record points to datatracker.ietf.org/doc/html/rfc1035 & datatracker.ietf.org/doc/html/rfc6763 neither of which mention it afaict (with a quick skim/search) and I search all of the RFCs linked off of 1035 as well. 🤷

https://datatracker.ietf.org/doc/html/rfc1035#section-3.3 is what you are looking for.
fwiw: I found the bind docs to be rather helpful https://kb.isc.org/docs/aa-00356

ross · 2024-05-20T20:11:42Z

I found the bind docs to be rather helpful https://kb.isc.org/docs/aa-00356

That linked to https://datatracker.ietf.org/doc/html/rfc4408#section-3.1.3, which does explicitly talk about multiple strings, though not the quoting part. That section also links to https://datatracker.ietf.org/doc/html/rfc1035#section-3.3 and https://datatracker.ietf.org/doc/html/rfc1035#section-3.3.14 which have previously been mentioned.

That 4408 bit in combination with the details of quoted strings from 1035 does combine together to spell out the full behavior, though nothing mentions whether quoted and unquoted can be mixed explicitly.

I think this PR gets things into as good as shape as it's possible to get them.

…dling

ross

/ci modules build

ross · 2024-06-04T21:41:54Z

DNS Made Easy is failing and looking at the test I'm not 100% sure whether this is a regression or that provider has incorrect expectations. I created a DNS Made Easy account, but can't get an API key for it. Filed a support ticket so I'll see where that goes.

5abf30b1 (Jessica Smith  2023-08-02 10:59:31 +0100 353)     "value": "\"This is a TXT record with \\\"quotes\\\" in it to ensure they are handled correctly\"",

Anyway, for the moment this is blocked.

ross · 2024-06-05T02:23:11Z

Anyway, for the moment this is blocked.

Set up a sandbox account, looks like TXT quote handling is generally broken in octndns-dnsmadeeasy, octodns/octodns-dnsmadeeasy#47.

Going to figure out what's up there before moving forward with this, but given that it's broken already I'm not going to consider the test failure blocking.

ross · 2024-06-05T02:44:10Z

I'm unable to get quotes into TXT values via the UI or API with DnsMadeEasy no matter what I try, so definitely not blocking... octodns/octodns-dnsmadeeasy#47 (comment)

ross · 2024-06-05T21:48:15Z

I'm unable to get quotes into TXT values via the UI or API with DnsMadeEasy no matter what I try, so definitely not blocking... octodns/octodns-dnsmadeeasy#47 (comment)

Looks like that's known/expected behavior for DNS Made Easy. Going to rework the tests as such and revisit if/when that changes octodns/octodns-dnsmadeeasy#47 (comment)

ross · 2024-06-07T04:11:15Z

Looks like that's known/expected behavior for DNS Made Easy. Going to rework the tests as such and revisit if/when that changes octodns/octodns-dnsmadeeasy#47 (comment)

/cc octodns/octodns-dnsmadeeasy#48 which removes the bad tests and adds a strict_supports check for quotes in TXT values.

Stale, changed since.

…dling

ross · 2024-06-12T15:09:51Z

As I've gotten further and further into the things that shake out of this change it's touched more things. I think I'm to the point to where I'm going to say it should be a 2.x change since it's fairly large and has tendrils. I need to sit down and figure out what other things i'd like to clean up as part of that, e.g. ; handling in values and a lot of deprecated/TODO remove from 2.x bits.

This will be incorporated into that planning.

Improve _ChunkedValue's handling of split chunks with unexpected whit…

89b3650

…espace

ross self-assigned this May 8, 2024

istr previously requested changes May 9, 2024

View reviewed changes

istr reviewed May 9, 2024

View reviewed changes

ross commented May 9, 2024

View reviewed changes

Near complete rework of chunked rdata handling/parsing

dd745f9

ross added 2 commits June 4, 2024 14:16

Clean up and address chunked test todos

db47c50

Merge remote-tracking branch 'origin/main' into chunked-variation-han…

60e8c63

…dling

ross commented Jun 4, 2024

View reviewed changes

ross mentioned this pull request Jun 5, 2024

Quote characters in TXT values aren't correctly handled octodns/octodns-dnsmadeeasy#47

Closed

ross requested a review from istr June 7, 2024 04:11

Merge remote-tracking branch 'origin/main' into chunked-variation-han…

9f8ac99

…dling

wip stuff to store progress, queuing for 2.x

a2e9af7

ross mentioned this pull request Jul 16, 2024

Chunk long TXT records when using ZoneFileProvider octodns/octodns-bind#65

Closed

github-actions bot added the Stale label Sep 17, 2024

github-actions bot closed this Sep 24, 2024

ross mentioned this pull request Oct 9, 2024

V2: Chunked value handling fixes #1219

Open

Uh oh!

Improve _ChunkedValue's handling of split chunks with unexpected whitespace #1172

Improve _ChunkedValue's handling of split chunks with unexpected whitespace #1172

Uh oh!

Conversation

ross commented May 8, 2024

Uh oh!

istr commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

istr left a comment

Choose a reason for hiding this comment

Uh oh!

istr May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

istr May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ross left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ross commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Janik-Haag commented May 19, 2024

Uh oh!

ross commented May 20, 2024

Uh oh!

ross left a comment

Choose a reason for hiding this comment

Uh oh!

ross commented Jun 4, 2024

Uh oh!

ross commented Jun 5, 2024

Uh oh!

ross commented Jun 5, 2024

Uh oh!

ross commented Jun 5, 2024

Uh oh!

ross commented Jun 7, 2024

Uh oh!

ross commented Jun 12, 2024

Uh oh!

Uh oh!

istr commented May 9, 2024 •

edited

Loading

istr May 9, 2024 •

edited

Loading

istr May 9, 2024 •

edited

Loading

ross commented May 10, 2024 •

edited

Loading