Skip to content

GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

odysa
Copy link

@odysa odysa commented May 31, 2025

GitHub issue resolved: #
#5343
Briefly describe the changes proposed in this PR:

This PR introduces an optimization for the size(...) method in the LMDBstore implementation. Introduce a cardinalityExact to calcualte the exact size, leverageingmdb_stats when possible.

Key Changes

  • ChangeSet

    • Fast-path Optimization

      • When there are no approved or deprecated changes in the current transaction or not in a transcation, the method directly delegates to the derivedFrom store for fast cardinality estimation.
    • Fallback to Iterator

      • When changes exist in the transaction, the method falls back to streaming through matching statements using getStatements(...).stream().count(). This bypasses LMDB’s lazy evaluation to ensure consistency, even with uncommitted changes.
  • Low-level Size Calculation (cardinalityExact)

    • If the pattern is completely unspecified (i.e. all wildcards), the method uses LMDB's mdb_stat to return the total size efficiently.
    • For specific patterns, it iterates over both explicit and implicit triples and counts the results.

Perf

I created a LMDBSail with 10M triples.
Original size(): 21802ms
Screenshot 2025-05-31 at 11 57 46 AM

Optimized size():
685.6 μs to get the full size by leveraging mdb_stats.
274.2 ms to get the size of a context of 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@odysa odysa force-pushed the feat/make-size-faster branch from e9dab06 to ff4a793 Compare May 31, 2025 16:18
@odysa odysa changed the title GH-5343 Make LMDBSail Size 30x Faster 🚀🚀 GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀 May 31, 2025
@kenwenzel
Copy link
Contributor

This is quite good. We could even get better if we don't materialize anything.
But for now we could could/should stay with your solution?!

@odysa
Copy link
Author

odysa commented May 31, 2025

Let me further explore the materialize-nothing solution to see how fast it can be.

@kenwenzel
Copy link
Contributor

Let me further explore the materialize-nothing solution to see how fast it can be.

Maybe we can parameterize the cardinality() function with exact and reuse the existing code?

@odysa
Copy link
Author

odysa commented May 31, 2025

Ok.
Counting 10M triples without context filter only took 685.2 μs
Counting triples within a context requires scanning all entires. It took 274.2ms to scan 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

@odysa
Copy link
Author

odysa commented May 31, 2025

My cardinalityExact implementation.

protected long cardinalityExact(final long subj, final long pred, final long obj, final long context)
			throws IOException {

		// get size of entire db
		if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID && obj == LmdbValue.UNKNOWN_ID
				&& context == LmdbValue.UNKNOWN_ID) {
			return txnManager.doWith((stack, txn) -> {
				long cardinality = 0;
				final TripleIndex index = getBestIndex(subj, pred, obj, context);
				for (boolean explicit : new boolean[] { true, false }) {
					int dbi = index.getDB(explicit);
					MDBStat stat = MDBStat.mallocStack(stack);
					mdb_stat(txn, dbi, stat);
					cardinality += stat.ms_entries();
				}
				return cardinality;
			});
		}

		try (TxnManager.Txn txn = txnManager.createReadTxn()) {
			final RecordIterator explicitIter = getTriples(txn, subj, pred, obj, context, true);
			final RecordIterator implicitIter = getTriples(txn, subj, pred, obj, context, false);
			long size = 0;
			try {
				for (long[] quad = explicitIter.next(); quad != null; quad = explicitIter.next()) {
					size++;
				}
				for (long[] quad = implicitIter.next(); quad != null; quad = implicitIter.next()) {
					size++;
				}
			} finally {
				explicitIter.close();
				implicitIter.close();
			}
			return size;
		}
	}

@odysa odysa force-pushed the feat/make-size-faster branch from ff4a793 to 6b638e0 Compare June 1, 2025 00:33
@odysa odysa changed the title GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀 GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 Jun 1, 2025
@odysa odysa force-pushed the feat/make-size-faster branch 6 times, most recently from 2be6b32 to c8c107a Compare June 1, 2025 02:15
@odysa
Copy link
Author

odysa commented Jun 1, 2025

For changeset:

If there are no approved or deprecated changes, it uses a fast direct lookup. Otherwise, it falls back to iterating over matching statements, which remains performant since it avoids LMDB’s lazy evaluation.

@Override
	public long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
		// Fast path: no approved or deprecated
		if (!changes.hasApproved() && !changes.hasDeprecated()) {
			return derivedFrom.size(subj, pred, obj, contexts);
		}

		// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation
		return getStatements(subj, pred, obj, contexts).stream().count();
	}

@odysa odysa force-pushed the feat/make-size-faster branch from c8c107a to 8c4b328 Compare June 1, 2025 02:19
@kenwenzel
Copy link
Contributor

Hi @odysa,

the context size is also already tracked:

private void incrementContext(MemoryStack stack, long context) throws IOException {

Comment on lines 1038 to 1053
protected long calculateSize(final boolean includeInferred, final Resource... contexts) throws SailException {
try (SailSource branch = branch(IncludeInferred.fromBoolean(includeInferred))) {
return branch.dataset(getIsolationLevel()).size(null, null, null, contexts);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for why we need a new method here?

From a quick glance I saw that there is protected long sizeInternal(Resource... contexts) throws SailException { which I wonder if would be a better place for this code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization is only for LMDB, so I made 2 different pathes for size.
LMDB gets size directly from banch.dataset
Other sails call getStatementsInternal to count.

I don't want to accidently break other sails, so leave sizeInternal unchanged.

@odysa odysa force-pushed the feat/make-size-faster branch from 8c4b328 to 87a524d Compare June 1, 2025 18:26
@odysa
Copy link
Author

odysa commented Jun 2, 2025

Done getting size from tracked context size

return derivedFrom.size(subj, pred, obj, contexts);
}

// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is a bit misleading - it actually remains fast by using (some kind of) lazy evaluation.
Maybe just use:
// Fallback path: iterate over all matching statements

@kenwenzel
Copy link
Contributor

Besides my comment it looks good to me.
@hmottestad What do you think about the changes in the SAIL implementation?

@hmottestad
Copy link
Contributor

I need to take a closer look. It would be best to not have to make any changes to the common sail code. We have the ElasticsearchStore that could also leverage a faster way to count by offloading it to Elasticsearch.

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Finally I would like to see if we can't add a benchmark to our current suite so that this new optimisation won't get forgotten when we make changes in the future.

Im a bit pressed for time right now though. I'll do my best to prioritise this.

@odysa
Copy link
Author

odysa commented Jun 3, 2025

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Lemme add more test cases to it.

@odysa odysa force-pushed the feat/make-size-faster branch from 87a524d to 44a16d2 Compare June 12, 2025 03:18
@odysa
Copy link
Author

odysa commented Jun 12, 2025

Added isolationLevel test.
Fixed the comment issue.

@odysa odysa requested review from kenwenzel and hmottestad June 12, 2025 03:18
@odysa odysa force-pushed the feat/make-size-faster branch from 44a16d2 to e4c1475 Compare June 15, 2025 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants