Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark for C++ PDQ index #1699

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

16BitNarwhal
Copy link
Contributor

@16BitNarwhal 16BitNarwhal commented Nov 28, 2024

Summary

Resolve #1697

C++ query benchmark program for brute force linear scan and mih (pdq/cpp/index/mih.h)
Benchmark implementation located in pdq/cpp/bin/benchmark-query.cpp
Benchmark results in pdq/cpp/index/README.md

To run:

cd pdq/cpp
make benchmark-query
./benchmark-query

Sample results (ran on Ubuntu 24.04.1 LTS, Intel Core i7-14700KF with 20 cores, 28 threads, 64GB RAM):

$ ./benchmark-query -m linear
METHOD: Linear query
QUERY COUNT:             1000
INDEX COUNT:             10000
TOTAL MATCH COUNT:       1000
TOTAL QUERY SECONDS:     0.258911
SECONDS PER QUERY:       0.000259

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some minor inline comments that I think will block extension. Otherwise looks good, thanks for doing it and for doing a README!

Comment on lines +43 to +48
// Helper declarations
static facebook::pdq::hashing::Hash256 generateRandomHash(std::mt19937& gen);
static facebook::pdq::hashing::Hash256 addNoise(
const facebook::pdq::hashing::Hash256& original,
int numBitsToFlip,
std::mt19937& gen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like generally useful function, and might be worth including in some kind of pdq/utils method. Some of your peers have ended up building the same functions in python.

fprintf(fp, "Usage: %s [options]\n", argv0);
fprintf(fp, "Options:\n");
fprintf(fp, " -v Verbose output\n");
fprintf(fp, " --seed N Random seed (default: 41)\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing a way to make this reproduceable

Comment on lines +1 to +3
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought you were including this for getopt. I'm a c++ dummy, what functions/functionalities are you including from GNU?

maxDistance, verbose, seed, indexSize, querySize, queries, index);
} else if (method == "mih") {
queryMIH(maxDistance, verbose, seed, indexSize, querySize, queries, index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure: Is there an else fail missing here?

Comment on lines +209 to +224
std::vector<std::pair<facebook::pdq::hashing::Hash256, std::string>> matches;

std::chrono::time_point<std::chrono::steady_clock> t1, t2;
std::chrono::duration<double> elapsedSeconds;

t1 = std::chrono::steady_clock::now();
for (const auto& it : queries) {
for (const auto& it2 : index) {
if (it.first.hammingDistance(it2.first) <= maxDistance) {
matches.push_back(it2);
}
}
}
t2 = std::chrono::steady_clock::now();
elapsedSeconds = t2 - t1;
double seconds = elapsedSeconds.count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to pull the timing functionality outside of the query method itself, so you have something like:

auto lookupFn = ...

// start timing
lookupFn(index, queries);
// End timing

size_t querySize = 1000;
std::string method = "linear";

// Parse command line arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignorable: There are an abundance of command line helpers out there (I'm a fan of gflags), but none in the standard library, and I think it's better to handroll than to take another dependency.

Comment on lines +316 to +321
for (int i = 0; i < numBitsToFlip; i++) {
int wordIndex = wordDist(gen);
int bitIndex = bitDist(gen);
// Flip bit with xor
noisy.w[wordIndex] ^= (1 << bitIndex);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Can you end up flipping the same bit twice and thus ending with less distance than you expect?

Why not do uniform random 256 and modulus instead of doing this in two passes?

A possible solution to blocker:

  1. Assert distance is less than some amount (64?)
  2. Copy the original
  3. Create another PDQ that represents the xor
  4. Repeat until X bits are flipped
    1. Randomly select an index in the xor
    2. If it's 0, flip it to 1
      5 .return original XOR xor

Alternatively, create a sequence from 0-256 and then use https://en.cppreference.com/w/cpp/algorithm/ranges/sample, then flip those bits

@@ -0,0 +1,82 @@
# Benchmark for index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing a readme!

Comment on lines +226 to +234
printf("METHOD: Linear query\n");
printf("QUERY COUNT: %d\n", (int)queries.size());
printf("INDEX COUNT: %d\n", (int)index.size());
printf("TOTAL MATCH COUNT: %d\n", (int)matches.size());
printf("TOTAL QUERY SECONDS: %.6lf\n", seconds);
printf(
"SECONDS PER QUERY: %.6lf\n",
querySize > 0 ? seconds / querySize : 0);
printf("\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Yeah, let's not force everyone to perfectly implement this print block, having this in shared code.

printf("TOTAL MATCH COUNT: %d\n", (int)matches.size());
printf("TOTAL QUERY SECONDS: %.6lf\n", seconds);
printf(
"SECONDS PER QUERY: %.6lf\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to do queries per second rather than seconds per query, given I think we'll be able to get at least 1 for most inputs, and our fastest approaches will have big numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pdq][mlh] Add querying benchmark to /bin
4 participants