-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rec: dedup records #14617
base: master
Are you sure you want to change the base?
rec: dedup records #14617
Conversation
Pull Request Test Coverage Report for Build 11912381008Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good to me. I guess we could try a few more heuristics before actually serializing the content, like checking if we have several records with the same (qtype, qname), but it might not be worth it. Having real-world numbers would indeed be useful.
|
||
string record; | ||
packetWriter.getContentWireFormat(record); // needs to be called before commit() | ||
return record; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go forward, we might want to consider a small refactoring to avoid duplicating code between serialize
and wireFormatContent
.
pdns/shuffle.hh
Outdated
@@ -29,4 +29,5 @@ namespace pdns | |||
{ | |||
void shuffle(std::vector<DNSZoneRecord>& rrs); | |||
void orderAndShuffle(std::vector<DNSRecord>& rrs, bool includingAdditionals); | |||
unsigned int dedup(std::vector<DNSRecord>& rrs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps dedupRecords
instead of dedup
?
863fd47
to
8089aeb
Compare
Rebased to fix conflict. |
Some observations: Software tested in default config |
1de6a90
to
5a98b0f
Compare
Speedtest results:
The measured slowdown is about 2.5 and is uniform over the various test case sizes. So the dedupping takes time as expected, but for the already pretty extreme case of 256, records, its absolute value is not a lot compared to the expected network latency. For the 4096 case we spent time that comes closer to the expected network latency. |
Can we rule out that deduping is an attack vector? |
Not completely, in the extreme case spending even a few CPU ms on a single auth result is quite a lot. |
080d63b
to
97795db
Compare
Are you considering an on/off switch for it? |
Yes, that would be one of the options. Another alternative would be to not do the dedupping on large answers as we already refuse to cache them anyway. |
I played a bit with a pre-scan on qtype and name only, but saw no speedup |
…ows for an unordered_set as well.
97795db
to
274b82b
Compare
274b82b
to
e26c334
Compare
Short description
This deduplicaties records in two places:
When testing this, I encountered a few cases where auths sent duplicate records.
Nothing actually breaks if we don't dedup afaik. So it is questionable of we want this part.
But on the client side, this fixes #14120 and maybe other cases I do not know.
The big question is if we want this. Dedupping is fundamentally not cheap, although I tried to optimize the dedup code.
Originally I played with the idea to change the data structure building the reply vector to avoid duplicates, but that requires changes in many places. Still the idea is not completely off the table.
On the receiving side the dedup internals could also be weaved into the sanitize code, at the cost of increasing complexity. That would avoid the separate dedup() call.
This will remain a draft until I have some some speed measurements and pondered the alternative approaches some more. This PR is mainly to share thoughts.
The test changes are needed as a few of them use duplicate records.
Checklist
I have: