-
-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve _ChunkedValue's handling of split chunks with unexpected whitespace #1172
base: main
Are you sure you want to change the base?
Conversation
You can find it distributed in RFC 1035.
Since This specification only applies to master files, strictly speaking. Over the wire, there is no space between the elements. However, I think it makes sense to allow relaxed whitespace semantics when it comes to HTTP-based API transport. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add error handling for malformed data where v starts with " but still does not end with " after the strip operation.
octodns/record/chunked.py
Outdated
@@ -62,9 +63,11 @@ def validate(cls, data, _type): | |||
def process(cls, values): | |||
ret = [] | |||
for v in values: | |||
# remove leading/trailing whitespace | |||
v = v.strip() | |||
if v and v[0] == '"': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw an error if v and v[0] == '"' and v[-1] != '"'
instead of silently allowing a malformed value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense & will look into this. Broadly speaking the error handling/checking happens in validate rather than process so that lenient=True
can do a best effort conversion if told to do so by the user. This would need to be taken into account if there's a shift to a parser style setup as discussed above, it would need to be as forgiving as possible, and need to be able to (more) strictly check things during validate.
octodns/record/chunked.py
Outdated
@@ -34,6 +34,7 @@ def rr_values(self): | |||
|
|||
class _ChunkedValue(str): | |||
_unescaped_semicolon_re = re.compile(r'\w;') | |||
_chunk_sep_re = re.compile(r'"\s+"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the spec would also allow "mixed case chunks" such as "foo bar baz" blablablablabla "yabba dabba dooh"
. Do we want to handle this as well or wait until we encounter this in the wild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case I would change the implementation to properly tokenize character-string
as per spec:
- Treat the leading and trailing " to be part of each chunk (=
character-string
), not"\s+"
to be a separator. - Check for wellformedness (unescaped " pair matching).
- Unescape embedded
\"
as per spec.
Due to the definition of character-string
I guess that a walk/state-based approach would work better than a regex based approach:
- set
target_value
to empty - walk whitespace (i.e. strip leading whitespace)
- set
chunk_start_pos
topos
- set
quoted_flag
ifv[pos] == '"'
else resetquoted_flag
- walk to next
"
or\s
depending on quoted flag - if
quoted_flag and '\' == v[pos-1]
: append totarget_value
fromchunk_start_pos
topos - 2
, repeat from 5. (this means: keep walking the current quoted chunk) - append to
target_value
fromchunk_start_pos
topos - 1
(this means: finish any current chunk) - repeat from 2.
- throw error if
quoted_flag and '"' != v[-1]
If you want to stick with the simple regex approach, I would tend to use "\s*"
to be the separator rather than \s+
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the spec would also allow "mixed case chunks" such as
"foo bar baz" blablablablabla "yabba dabba dooh"
. Do we want to handle this as well or wait until we encounter this in the wild?
Interesting.
In any case I would change the implementation to properly tokenize
character-string
as per spec:
...
Will explore this, and probably check and see what other DNS libs do for it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the spec would also allow "mixed case chunks" such as
"foo bar baz" blablablablabla "yabba dabba dooh"
. Do we want to handle this as well or wait until we encounter this in the wild?
Double checking my current read here. Unquoted values cannot contain spaces? The old code definitely allowed that, but tbh no clue if it actually ever saw it in the real world. Parser-wise I'm not really sure how we'd support mixing quoted and unquoted if the unquoted bits can have spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured the best/easiest thing to do would be to turn to octodns-bind and zonefile provider to see how it behaves since it's using dnspython for reading the zonefile:
Zonefile
one IN TXT no quotes in this one
two IN TXT "this one was quoted"
three IN TXT "this one has both, this quoted bit" and this unquoted section
four IN TXT "multiple quoted" "sections in this one"
five IN TXT "same with multiple spaces" "between the quotes"
six IN TXT "leading spaces before" "and trailing spaces after"
seven IN TXT "first value"
seven IN TXT "second value"
Data loaded from that
[Rr<exxampled.com., NS, 86400, dns1.exxampled.com.,
Rr<exxampled.com., NS, 86400, dns2.exxampled.com.,
Rr<one.exxampled.com., TXT, 86400, "no" "quotes" "in" "this" "one",
Rr<two.exxampled.com., TXT, 86400, "this one was quoted",
Rr<three.exxampled.com., TXT, 86400, "this one has both, this quoted bit" "and" "this" "unquoted" "section",
Rr<four.exxampled.com., TXT, 86400, "multiple quoted" "sections in this one",
Rr<five.exxampled.com., TXT, 86400, "same with multiple spaces" "between the quotes",
Rr<six.exxampled.com., TXT, 86400, "leading spaces before" "and trailing spaces after",
Rr<seven.exxampled.com., TXT, 86400, "first value",
Rr<seven.exxampled.com., TXT, 86400, "second value"]
And what octoDNS currently makes of that:
********************************************************************************
* exxampled.com.
********************************************************************************
* config (YamlProvider)
* Create Zone<exxampled.com.>
* Create <NsRecord NS 86400, exxampled.com., ['dns1.exxampled.com.', 'dns2.exxampled.com.']> ()
* Create <TxtRecord TXT 86400, five.exxampled.com., ['same with multiple spacesbetween the quotes']> ()
* Create <TxtRecord TXT 86400, four.exxampled.com., ['multiple quotedsections in this one']> ()
* Create <TxtRecord TXT 86400, one.exxampled.com., ['noquotesinthisone']> ()
* Create <TxtRecord TXT 86400, seven.exxampled.com., ['first value', 'second value']> ()
* Create <TxtRecord TXT 86400, six.exxampled.com., ['leading spaces beforeand trailing spaces after']> ()
* Create <TxtRecord TXT 86400, three.exxampled.com., ['this one has both, this quoted bitandthisunquotedsection']> ()
* Create <TxtRecord TXT 86400, two.exxampled.com., ['this one was quoted']> ()
* Summary: Creates=8, Updates=0, Deletes=0, Existing Records=0
********************************************************************************
That pretty well lays out the scope of possible behavior.
It is clear that octodns-bind wouldn't round trip unquoted items. I think results would semantically be the same, but the actual text written out file will all be quoted.:
five 86400 IN TXT "same with multiple spacesbetween the quotes"
four 86400 IN TXT "multiple quotedsections in this one"
one 86400 IN TXT "nospacesinthisone"
seven 86400 IN TXT "first value"
86400 IN TXT "second value"
six 86400 IN TXT "leading spaces beforeand trailing spaces after"
three 86400 IN TXT "this one has both, this quoted bitandthisunquotedsection"
two 86400 IN TXT "this one was quoted"
I don't see a way this could be avoided since octoDNS internally represents a value as an single arbitrary length string which I think is FAR preferable to avoid every user having to deal will all the complexities that we're no looking at.
So my plan for a path forward is to explore using dnspython's code for parsing the data. I remember looking into it back when I first did all the rdata stuff, but didn't end up using it and I don't recall exactly why. If that doesn't end up making sense I'll look at a parser setup that will match the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have a setup that's using dnspython to do the handling of the strings. It's not super straightforward since dnspython internally holds things as a list of bytes
, but that can be decoded into what octoDNS needs. The other "extra" bit here is that dnspython does not escape ;
, which is one of the biggest regrets I have with octoDNS. I so wish I didn't decide that they needed to be escaped in TXT values. I just put it on my TODO list to explore whether we can get rid of that as part of a major release, but ...
A side effect of all this digging work is that a bunch of the existing TXT rdata test cases, all the ones w/o quotes, fail with the changes as there was an incorrect assumption that whitespace should be preserved. I don't think any rdata commonly works with unquoted TXT values and looking at the code/uses I think that change is fine, but just wanted to call it out/record it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that, dnspython route isn't viable as there are validation checks preventing it from processing tokens > 255 chars. I can't feed it pieces of the value since the parsing looking for quotes would span the chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Any combination of tabs and spaces act as a delimiter between the separate items that make up an entry.
Guess it's good that I added tests/handling for both spaces & tabs then.
- is expressed in one or two ways: as a contiguous set of characters without interior spaces, or as a string beginning with a " and ending with a ".
Thanks for digging this up. I was looking for something more specifically tied to TXT/SPF records, wasn't really thinking about it being a more generalized separator.
octodns/record/chunked.py
Outdated
@@ -34,6 +34,7 @@ def rr_values(self): | |||
|
|||
class _ChunkedValue(str): | |||
_unescaped_semicolon_re = re.compile(r'\w;') | |||
_chunk_sep_re = re.compile(r'"\s+"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the spec would also allow "mixed case chunks" such as
"foo bar baz" blablablablabla "yabba dabba dooh"
. Do we want to handle this as well or wait until we encounter this in the wild?
Interesting.
In any case I would change the implementation to properly tokenize
character-string
as per spec:
...
Will explore this, and probably check and see what other DNS libs do for it as well.
octodns/record/chunked.py
Outdated
@@ -62,9 +63,11 @@ def validate(cls, data, _type): | |||
def process(cls, values): | |||
ret = [] | |||
for v in values: | |||
# remove leading/trailing whitespace | |||
v = v.strip() | |||
if v and v[0] == '"': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense & will look into this. Broadly speaking the error handling/checking happens in validate rather than process so that lenient=True
can do a best effort conversion if told to do so by the user. This would need to be taken into account if there's a shift to a parser style setup as discussed above, it would need to be as forgiving as possible, and need to be able to (more) strictly check things during validate.
This turned out to be involved. The existing It should now adhere to the spec and convert things to octoDNS's format (plain strings) correctly. The Record.new rdata handling should now correctly parses quoted strings and handles escaped quotes. It does not/can not deal with unquoted whitespace separated chunks as it doesn't know when it's getting rdata and when it's getting a plain octoDNS style string. So basically quoted means treat it as rdata items. Unquoted means use it as-is verbatim. I don't think anything actually uses non-quoted txt rdata so I think this is a non-issue or else problems would have come up before... Anyway this should change behavior beyond fixing previous problems/shortcommings, but it is a BIG change to the logic/code so it needs lots of 👀 and careful thinking before moving forward. |
https://datatracker.ietf.org/doc/html/rfc1035#section-3.3 is what you are looking for. |
That linked to https://datatracker.ietf.org/doc/html/rfc4408#section-3.1.3, which does explicitly talk about multiple strings, though not the quoting part. That section also links to https://datatracker.ietf.org/doc/html/rfc1035#section-3.3 and https://datatracker.ietf.org/doc/html/rfc1035#section-3.3.14 which have previously been mentioned. That 4408 bit in combination with the details of quoted strings from 1035 does combine together to spell out the full behavior, though nothing mentions whether quoted and unquoted can be mixed explicitly. I think this PR gets things into as good as shape as it's possible to get them. |
This allows arbitrary whitespace before the opening
"
, after the closing"
, or in between the chunks. Lots of tests.For the life of me I couldn't find an RFC section that covered chunked TXT records details.
https://en.wikipedia.org/wiki/TXT_record points to https://datatracker.ietf.org/doc/html/rfc1035 & https://datatracker.ietf.org/doc/html/rfc6763 neither of which mention it afaict (with a quick skim/search) and I search all of the RFCs linked off of 1035 as well. 🤷
/cc octodns/octodns-hetzner#37 @peterablehmann @istr