Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve _ChunkedValue's handling of split chunks with unexpected whitespace #1172

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ross
Copy link
Contributor

@ross ross commented May 8, 2024

This allows arbitrary whitespace before the opening ", after the closing ", or in between the chunks. Lots of tests.

For the life of me I couldn't find an RFC section that covered chunked TXT records details.

https://en.wikipedia.org/wiki/TXT_record points to https://datatracker.ietf.org/doc/html/rfc1035 & https://datatracker.ietf.org/doc/html/rfc6763 neither of which mention it afaict (with a quick skim/search) and I search all of the RFCs linked off of 1035 as well. 🤷

/cc octodns/octodns-hetzner#37 @peterablehmann @istr

@ross ross self-assigned this May 8, 2024
@istr
Copy link

istr commented May 9, 2024

@ross

For the life of me I couldn't find an RFC section that covered chunked TXT records details.

You can find it distributed in RFC 1035.

  • 5.1. Format
    The format of these files is a sequence of entries. Entries are predominantly line-oriented, though parentheses can be used to continue a list of items across a line boundary, and text literals can contain CRLF within the text. Any combination of tabs and spaces act as a delimiter between the separate items that make up an entry.
    [...]
    Blank lines, with or without comments, are allowed anywhere in the file.
    [...]
    is expressed in one or two ways: as a contiguous set of characters without interior spaces, or as a string beginning with a " and ending with a ".
  • 3.3.14. TXT RDATA format
    TXT-DATA One or more s.

Since character-string is a separate item in the spec, and items can have "any combination of tabs and spaces as a delimiter", you have it. Either two character-strings are on separate lines (possibly with blank lines in between) or they are separated by at least one space or tab.

This specification only applies to master files, strictly speaking. Over the wire, there is no space between the elements. However, I think it makes sense to allow relaxed whitespace semantics when it comes to HTTP-based API transport.

Copy link

@istr istr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add error handling for malformed data where v starts with " but still does not end with " after the strip operation.

@@ -62,9 +63,11 @@ def validate(cls, data, _type):
def process(cls, values):
ret = []
for v in values:
# remove leading/trailing whitespace
v = v.strip()
if v and v[0] == '"':
Copy link

@istr istr May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an error if v and v[0] == '"' and v[-1] != '"' instead of silently allowing a malformed value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense & will look into this. Broadly speaking the error handling/checking happens in validate rather than process so that lenient=True can do a best effort conversion if told to do so by the user. This would need to be taken into account if there's a shift to a parser style setup as discussed above, it would need to be as forgiving as possible, and need to be able to (more) strictly check things during validate.

@@ -34,6 +34,7 @@ def rr_values(self):

class _ChunkedValue(str):
_unescaped_semicolon_re = re.compile(r'\w;')
_chunk_sep_re = re.compile(r'"\s+"')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

Copy link

@istr istr May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case I would change the implementation to properly tokenize character-string as per spec:

  • Treat the leading and trailing " to be part of each chunk (= character-string), not "\s+" to be a separator.
  • Check for wellformedness (unescaped " pair matching).
  • Unescape embedded \" as per spec.

Due to the definition of character-string I guess that a walk/state-based approach would work better than a regex based approach:

  1. set target_value to empty
  2. walk whitespace (i.e. strip leading whitespace)
  3. set chunk_start_pos to pos
  4. set quoted_flag if v[pos] == '"' else reset quoted_flag
  5. walk to next " or \s depending on quoted flag
  6. if quoted_flag and '\' == v[pos-1]: append to target_value from chunk_start_pos to pos - 2, repeat from 5. (this means: keep walking the current quoted chunk)
  7. append to target_value from chunk_start_pos to pos - 1 (this means: finish any current chunk)
  8. repeat from 2.
  9. throw error if quoted_flag and '"' != v[-1]

If you want to stick with the simple regex approach, I would tend to use "\s*" to be the separator rather than \s+.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

Interesting.

In any case I would change the implementation to properly tokenize character-string as per spec:
...

Will explore this, and probably check and see what other DNS libs do for it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

Double checking my current read here. Unquoted values cannot contain spaces? The old code definitely allowed that, but tbh no clue if it actually ever saw it in the real world. Parser-wise I'm not really sure how we'd support mixing quoted and unquoted if the unquoted bits can have spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured the best/easiest thing to do would be to turn to octodns-bind and zonefile provider to see how it behaves since it's using dnspython for reading the zonefile:

Zonefile

one IN  TXT no quotes in this one
two IN  TXT "this one was quoted"
three IN  TXT "this one has both, this quoted bit" and this unquoted section
four IN  TXT "multiple quoted" "sections in this one"
five IN  TXT "same with multiple spaces"     "between the quotes"
six IN  TXT   "leading spaces before" "and trailing spaces after"    
seven IN  TXT   "first value"
seven IN  TXT   "second value"

Data loaded from that

[Rr<exxampled.com., NS, 86400, dns1.exxampled.com.,
 Rr<exxampled.com., NS, 86400, dns2.exxampled.com.,
 Rr<one.exxampled.com., TXT, 86400, "no" "quotes" "in" "this" "one",
 Rr<two.exxampled.com., TXT, 86400, "this one was quoted",
 Rr<three.exxampled.com., TXT, 86400, "this one has both, this quoted bit" "and" "this" "unquoted" "section",
 Rr<four.exxampled.com., TXT, 86400, "multiple quoted" "sections in this one",
 Rr<five.exxampled.com., TXT, 86400, "same with multiple spaces" "between the quotes",
 Rr<six.exxampled.com., TXT, 86400, "leading spaces before" "and trailing spaces after",
 Rr<seven.exxampled.com., TXT, 86400, "first value",
 Rr<seven.exxampled.com., TXT, 86400, "second value"]

And what octoDNS currently makes of that:

********************************************************************************
* exxampled.com.
********************************************************************************
* config (YamlProvider)
*   Create Zone<exxampled.com.>
*   Create <NsRecord NS 86400, exxampled.com., ['dns1.exxampled.com.', 'dns2.exxampled.com.']> ()
*   Create <TxtRecord TXT 86400, five.exxampled.com., ['same with multiple spacesbetween the quotes']> ()
*   Create <TxtRecord TXT 86400, four.exxampled.com., ['multiple quotedsections in this one']> ()
*   Create <TxtRecord TXT 86400, one.exxampled.com., ['noquotesinthisone']> ()
*   Create <TxtRecord TXT 86400, seven.exxampled.com., ['first value', 'second value']> ()
*   Create <TxtRecord TXT 86400, six.exxampled.com., ['leading spaces beforeand trailing spaces after']> ()
*   Create <TxtRecord TXT 86400, three.exxampled.com., ['this one has both, this quoted bitandthisunquotedsection']> ()
*   Create <TxtRecord TXT 86400, two.exxampled.com., ['this one was quoted']> ()
*   Summary: Creates=8, Updates=0, Deletes=0, Existing Records=0
********************************************************************************

That pretty well lays out the scope of possible behavior.

It is clear that octodns-bind wouldn't round trip unquoted items. I think results would semantically be the same, but the actual text written out file will all be quoted.:

five     86400 IN TXT      "same with multiple spacesbetween the quotes"
four     86400 IN TXT      "multiple quotedsections in this one"
one      86400 IN TXT      "nospacesinthisone"
seven    86400 IN TXT      "first value"
         86400 IN TXT      "second value"
six      86400 IN TXT      "leading spaces beforeand trailing spaces after"
three    86400 IN TXT      "this one has both, this quoted bitandthisunquotedsection"
two      86400 IN TXT      "this one was quoted"

I don't see a way this could be avoided since octoDNS internally represents a value as an single arbitrary length string which I think is FAR preferable to avoid every user having to deal will all the complexities that we're no looking at.

So my plan for a path forward is to explore using dnspython's code for parsing the data. I remember looking into it back when I first did all the rdata stuff, but didn't end up using it and I don't recall exactly why. If that doesn't end up making sense I'll look at a parser setup that will match the behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have a setup that's using dnspython to do the handling of the strings. It's not super straightforward since dnspython internally holds things as a list of bytes, but that can be decoded into what octoDNS needs. The other "extra" bit here is that dnspython does not escape ;, which is one of the biggest regrets I have with octoDNS. I so wish I didn't decide that they needed to be escaped in TXT values. I just put it on my TODO list to explore whether we can get rid of that as part of a major release, but ...

A side effect of all this digging work is that a bunch of the existing TXT rdata test cases, all the ones w/o quotes, fail with the changes as there was an incorrect assumption that whitespace should be preserved. I don't think any rdata commonly works with unquoted TXT values and looking at the code/uses I think that change is fine, but just wanted to call it out/record it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, dnspython route isn't viable as there are validation checks preventing it from processing tokens > 255 chars. I can't feed it pieces of the value since the parsing looking for quotes would span the chunks.

Copy link
Contributor Author

@ross ross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Any combination of tabs and spaces act as a delimiter between the separate items that make up an entry.

Guess it's good that I added tests/handling for both spaces & tabs then.

  • is expressed in one or two ways: as a contiguous set of characters without interior spaces, or as a string beginning with a " and ending with a ".

Thanks for digging this up. I was looking for something more specifically tied to TXT/SPF records, wasn't really thinking about it being a more generalized separator.

@@ -34,6 +34,7 @@ def rr_values(self):

class _ChunkedValue(str):
_unescaped_semicolon_re = re.compile(r'\w;')
_chunk_sep_re = re.compile(r'"\s+"')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh". Do we want to handle this as well or wait until we encounter this in the wild?

Interesting.

In any case I would change the implementation to properly tokenize character-string as per spec:
...

Will explore this, and probably check and see what other DNS libs do for it as well.

@@ -62,9 +63,11 @@ def validate(cls, data, _type):
def process(cls, values):
ret = []
for v in values:
# remove leading/trailing whitespace
v = v.strip()
if v and v[0] == '"':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense & will look into this. Broadly speaking the error handling/checking happens in validate rather than process so that lenient=True can do a best effort conversion if told to do so by the user. This would need to be taken into account if there's a shift to a parser style setup as discussed above, it would need to be as forgiving as possible, and need to be able to (more) strictly check things during validate.

@ross
Copy link
Contributor Author

ross commented May 10, 2024

This turned out to be involved. The existing parse_rdata_text didn't follow the spec at all. It just took the rdata as-is after replacing ; with \\;. That bit is required due to poor decisions years ago, but otherwise not actually unquoting quoted sections and including them verbatim was wrong as was the lack of concatenating whitespace separated chunks.

It should now adhere to the spec and convert things to octoDNS's format (plain strings) correctly.

The Record.new rdata handling should now correctly parses quoted strings and handles escaped quotes. It does not/can not deal with unquoted whitespace separated chunks as it doesn't know when it's getting rdata and when it's getting a plain octoDNS style string. So basically quoted means treat it as rdata items. Unquoted means use it as-is verbatim.

I don't think anything actually uses non-quoted txt rdata so I think this is a non-issue or else problems would have come up before...

Anyway this should change behavior beyond fixing previous problems/shortcommings, but it is a BIG change to the logic/code so it needs lots of 👀 and careful thinking before moving forward.

@Janik-Haag
Copy link
Contributor

en.wikipedia.org/wiki/TXT_record points to datatracker.ietf.org/doc/html/rfc1035 & datatracker.ietf.org/doc/html/rfc6763 neither of which mention it afaict (with a quick skim/search) and I search all of the RFCs linked off of 1035 as well. 🤷

https://datatracker.ietf.org/doc/html/rfc1035#section-3.3 is what you are looking for.
fwiw: I found the bind docs to be rather helpful https://kb.isc.org/docs/aa-00356

@ross
Copy link
Contributor Author

ross commented May 20, 2024

I found the bind docs to be rather helpful https://kb.isc.org/docs/aa-00356

That linked to https://datatracker.ietf.org/doc/html/rfc4408#section-3.1.3, which does explicitly talk about multiple strings, though not the quoting part. That section also links to https://datatracker.ietf.org/doc/html/rfc1035#section-3.3 and https://datatracker.ietf.org/doc/html/rfc1035#section-3.3.14 which have previously been mentioned.

That 4408 bit in combination with the details of quoted strings from 1035 does combine together to spell out the full behavior, though nothing mentions whether quoted and unquoted can be mixed explicitly.

I think this PR gets things into as good as shape as it's possible to get them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants