-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark for C++ PDQ index #1699
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only some minor inline comments that I think will block extension. Otherwise looks good, thanks for doing it and for doing a README!
// Helper declarations | ||
static facebook::pdq::hashing::Hash256 generateRandomHash(std::mt19937& gen); | ||
static facebook::pdq::hashing::Hash256 addNoise( | ||
const facebook::pdq::hashing::Hash256& original, | ||
int numBitsToFlip, | ||
std::mt19937& gen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem like generally useful function, and might be worth including in some kind of pdq/utils method. Some of your peers have ended up building the same functions in python.
fprintf(fp, "Usage: %s [options]\n", argv0); | ||
fprintf(fp, "Options:\n"); | ||
fprintf(fp, " -v Verbose output\n"); | ||
fprintf(fp, " --seed N Random seed (default: 41)\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing a way to make this reproduceable
#ifndef _GNU_SOURCE | ||
#define _GNU_SOURCE | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought you were including this for getopt
. I'm a c++ dummy, what functions/functionalities are you including from GNU?
maxDistance, verbose, seed, indexSize, querySize, queries, index); | ||
} else if (method == "mih") { | ||
queryMIH(maxDistance, verbose, seed, indexSize, querySize, queries, index); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsure: Is there an else fail missing here?
std::vector<std::pair<facebook::pdq::hashing::Hash256, std::string>> matches; | ||
|
||
std::chrono::time_point<std::chrono::steady_clock> t1, t2; | ||
std::chrono::duration<double> elapsedSeconds; | ||
|
||
t1 = std::chrono::steady_clock::now(); | ||
for (const auto& it : queries) { | ||
for (const auto& it2 : index) { | ||
if (it.first.hammingDistance(it2.first) <= maxDistance) { | ||
matches.push_back(it2); | ||
} | ||
} | ||
} | ||
t2 = std::chrono::steady_clock::now(); | ||
elapsedSeconds = t2 - t1; | ||
double seconds = elapsedSeconds.count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to pull the timing functionality outside of the query method itself, so you have something like:
auto lookupFn = ...
// start timing
lookupFn(index, queries);
// End timing
size_t querySize = 1000; | ||
std::string method = "linear"; | ||
|
||
// Parse command line arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignorable: There are an abundance of command line helpers out there (I'm a fan of gflags), but none in the standard library, and I think it's better to handroll than to take another dependency.
for (int i = 0; i < numBitsToFlip; i++) { | ||
int wordIndex = wordDist(gen); | ||
int bitIndex = bitDist(gen); | ||
// Flip bit with xor | ||
noisy.w[wordIndex] ^= (1 << bitIndex); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Can you end up flipping the same bit twice and thus ending with less distance than you expect?
Why not do uniform random 256 and modulus instead of doing this in two passes?
A possible solution to blocker:
- Assert distance is less than some amount (64?)
- Copy the original
- Create another PDQ that represents the xor
- Repeat until X bits are flipped
- Randomly select an index in the xor
- If it's 0, flip it to 1
5 .return original XOR xor
Alternatively, create a sequence from 0-256 and then use https://en.cppreference.com/w/cpp/algorithm/ranges/sample, then flip those bits
@@ -0,0 +1,82 @@ | |||
# Benchmark for index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing a readme!
printf("METHOD: Linear query\n"); | ||
printf("QUERY COUNT: %d\n", (int)queries.size()); | ||
printf("INDEX COUNT: %d\n", (int)index.size()); | ||
printf("TOTAL MATCH COUNT: %d\n", (int)matches.size()); | ||
printf("TOTAL QUERY SECONDS: %.6lf\n", seconds); | ||
printf( | ||
"SECONDS PER QUERY: %.6lf\n", | ||
querySize > 0 ? seconds / querySize : 0); | ||
printf("\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Yeah, let's not force everyone to perfectly implement this print block, having this in shared code.
printf("TOTAL MATCH COUNT: %d\n", (int)matches.size()); | ||
printf("TOTAL QUERY SECONDS: %.6lf\n", seconds); | ||
printf( | ||
"SECONDS PER QUERY: %.6lf\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to do queries per second rather than seconds per query, given I think we'll be able to get at least 1 for most inputs, and our fastest approaches will have big numbers.
Summary
Resolve #1697
C++ query benchmark program for brute force linear scan and mih (
pdq/cpp/index/mih.h
)Benchmark implementation located in
pdq/cpp/bin/benchmark-query.cpp
Benchmark results in
pdq/cpp/index/README.md
To run:
Sample results (ran on Ubuntu 24.04.1 LTS, Intel Core i7-14700KF with 20 cores, 28 threads, 64GB RAM):