GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

odysa · 2025-05-31T16:12:11Z

GitHub issue resolved: #
#5343
Briefly describe the changes proposed in this PR:

This PR introduces an optimization for the size(...) method in the LMDBstore implementation. Introduce a cardinalityExact to calcualte the exact size, leverageingmdb_stats when possible.

Key Changes

ChangeSet
- Fast-path Optimization
  - When there are no approved or deprecated changes in the current transaction or not in a transcation, the method directly delegates to the derivedFrom store for fast cardinality estimation.
- Fallback to Iterator
  - When changes exist in the transaction, the method falls back to streaming through matching statements using getStatements(...).stream().count(). This bypasses LMDB’s lazy evaluation to ensure consistency, even with uncommitted changes.
Low-level Size Calculation (cardinalityExact)
- If the pattern is completely unspecified (i.e. all wildcards), the method uses LMDB's mdb_stat to return the total size efficiently.
- For specific patterns, it iterates over both explicit and implicit triples and counts the results.

Perf

I created a LMDBSail with 10M triples.
Original size(): 21802ms

Optimized size():
685.6 μs to get the full size by leveraging mdb_stats.
274.2 ms to get the size of a context of 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

PR Author Checklist (see the contributor guidelines for more details):

my pull request is self-contained
I've added tests for the changes I made
I've applied code formatting (you can use mvn process-resources to format from the command line)
I've squashed my commits where necessary
every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

kenwenzel · 2025-05-31T17:48:14Z

This is quite good. We could even get better if we don't materialize anything.
But for now we could could/should stay with your solution?!

odysa · 2025-05-31T18:32:10Z

Let me further explore the materialize-nothing solution to see how fast it can be.

kenwenzel · 2025-05-31T21:21:12Z

Let me further explore the materialize-nothing solution to see how fast it can be.

Maybe we can parameterize the cardinality() function with exact and reuse the existing code?

odysa · 2025-05-31T22:54:52Z

Ok.
Counting 10M triples without context filter only took 685.2 μs
Counting triples within a context requires scanning all entires. It took 274.2ms to scan 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

odysa · 2025-05-31T22:55:22Z

My cardinalityExact implementation.

protected long cardinalityExact(final long subj, final long pred, final long obj, final long context)
			throws IOException {

		// get size of entire db
		if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID && obj == LmdbValue.UNKNOWN_ID
				&& context == LmdbValue.UNKNOWN_ID) {
			return txnManager.doWith((stack, txn) -> {
				long cardinality = 0;
				final TripleIndex index = getBestIndex(subj, pred, obj, context);
				for (boolean explicit : new boolean[] { true, false }) {
					int dbi = index.getDB(explicit);
					MDBStat stat = MDBStat.mallocStack(stack);
					mdb_stat(txn, dbi, stat);
					cardinality += stat.ms_entries();
				}
				return cardinality;
			});
		}

		try (TxnManager.Txn txn = txnManager.createReadTxn()) {
			final RecordIterator explicitIter = getTriples(txn, subj, pred, obj, context, true);
			final RecordIterator implicitIter = getTriples(txn, subj, pred, obj, context, false);
			long size = 0;
			try {
				for (long[] quad = explicitIter.next(); quad != null; quad = explicitIter.next()) {
					size++;
				}
				for (long[] quad = implicitIter.next(); quad != null; quad = implicitIter.next()) {
					size++;
				}
			} finally {
				explicitIter.close();
				implicitIter.close();
			}
			return size;
		}
	}

odysa · 2025-06-01T02:18:28Z

For changeset:

If there are no approved or deprecated changes, it uses a fast direct lookup. Otherwise, it falls back to iterating over matching statements, which remains performant since it avoids LMDB’s lazy evaluation.

@Override
	public long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
		// Fast path: no approved or deprecated
		if (!changes.hasApproved() && !changes.hasDeprecated()) {
			return derivedFrom.size(subj, pred, obj, contexts);
		}

		// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation
		return getStatements(subj, pred, obj, contexts).stream().count();
	}

kenwenzel · 2025-06-01T08:16:54Z

Hi @odysa,

the context size is also already tracked:

rdf4j/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

Line 913 in 157627b

    
           private void incrementContext(MemoryStack stack, long context) throws IOException {

hmottestad · 2025-06-01T12:13:23Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailSourceConnection.java

+	protected long calculateSize(final boolean includeInferred, final Resource... contexts) throws SailException {
+		try (SailSource branch = branch(IncludeInferred.fromBoolean(includeInferred))) {
+			return branch.dataset(getIsolationLevel()).size(null, null, null, contexts);
+		}
+	}
+


Any reason for why we need a new method here?

From a quick glance I saw that there is protected long sizeInternal(Resource... contexts) throws SailException { which I wonder if would be a better place for this code?

This optimization is only for LMDB, so I made 2 different pathes for size.
LMDB gets size directly from banch.dataset
Other sails call getStatementsInternal to count.

I don't want to accidently break other sails, so leave sizeInternal unchanged.

odysa · 2025-06-02T03:01:56Z

Done getting size from tracked context size

kenwenzel · 2025-06-02T14:30:13Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDatasetImpl.java

+			return derivedFrom.size(subj, pred, obj, contexts);
+		}
+
+		// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation


Comment is a bit misleading - it actually remains fast by using (some kind of) lazy evaluation.
Maybe just use:
// Fallback path: iterate over all matching statements

kenwenzel · 2025-06-02T14:31:38Z

Besides my comment it looks good to me.
@hmottestad What do you think about the changes in the SAIL implementation?

hmottestad · 2025-06-02T16:54:50Z

I need to take a closer look. It would be best to not have to make any changes to the common sail code. We have the ElasticsearchStore that could also leverage a faster way to count by offloading it to Elasticsearch.

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Finally I would like to see if we can't add a benchmark to our current suite so that this new optimisation won't get forgotten when we make changes in the future.

Im a bit pressed for time right now though. I'll do my best to prioritise this.

odysa · 2025-06-03T03:03:42Z

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Lemme add more test cases to it.

odysa · 2025-06-12T03:18:25Z

Added isolationLevel test.
Fixed the comment issue.

odysa force-pushed the feat/make-size-faster branch from e9dab06 to ff4a793 Compare May 31, 2025 16:18

odysa changed the title ~~GH-5343 Make LMDBSail Size 30x Faster 🚀🚀~~ GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀 May 31, 2025

odysa force-pushed the feat/make-size-faster branch from ff4a793 to 6b638e0 Compare June 1, 2025 00:33

odysa changed the title ~~GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀~~ GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 Jun 1, 2025

odysa force-pushed the feat/make-size-faster branch 6 times, most recently from 2be6b32 to c8c107a Compare June 1, 2025 02:15

odysa force-pushed the feat/make-size-faster branch from c8c107a to 8c4b328 Compare June 1, 2025 02:19

hmottestad reviewed Jun 1, 2025

View reviewed changes

odysa force-pushed the feat/make-size-faster branch from 8c4b328 to 87a524d Compare June 1, 2025 18:26

kenwenzel reviewed Jun 2, 2025

View reviewed changes

odysa force-pushed the feat/make-size-faster branch from 87a524d to 44a16d2 Compare June 12, 2025 03:18

odysa requested review from kenwenzel and hmottestad June 12, 2025 03:18

eclipse-rdf4jGH-5343 speed-up size calculation in LmdbStore

e4c1475

odysa force-pushed the feat/make-size-faster branch from 44a16d2 to e4c1475 Compare June 15, 2025 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

Uh oh!

odysa commented May 31, 2025 •

edited

Loading

Uh oh!

kenwenzel commented May 31, 2025

Uh oh!

odysa commented May 31, 2025

Uh oh!

kenwenzel commented May 31, 2025

Uh oh!

odysa commented May 31, 2025

Uh oh!

odysa commented May 31, 2025 •

edited

Loading

Uh oh!

odysa commented Jun 1, 2025

Uh oh!

kenwenzel commented Jun 1, 2025

Uh oh!

hmottestad Jun 1, 2025

Uh oh!

odysa Jun 1, 2025

Uh oh!

odysa commented Jun 2, 2025

Uh oh!

kenwenzel Jun 2, 2025

Uh oh!

kenwenzel commented Jun 2, 2025

Uh oh!

hmottestad commented Jun 2, 2025

Uh oh!

odysa commented Jun 3, 2025

Uh oh!

odysa commented Jun 12, 2025

Uh oh!

Uh oh!

GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

Are you sure you want to change the base?

GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

Uh oh!

Conversation

odysa commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Perf

Uh oh!

kenwenzel commented May 31, 2025

Uh oh!

odysa commented May 31, 2025

Uh oh!

kenwenzel commented May 31, 2025

Uh oh!

odysa commented May 31, 2025

Uh oh!

odysa commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odysa commented Jun 1, 2025

Uh oh!

kenwenzel commented Jun 1, 2025

Uh oh!

hmottestad Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

odysa Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

odysa commented Jun 2, 2025

Uh oh!

kenwenzel Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

kenwenzel commented Jun 2, 2025

Uh oh!

hmottestad commented Jun 2, 2025

Uh oh!

odysa commented Jun 3, 2025

Uh oh!

odysa commented Jun 12, 2025

Uh oh!

Uh oh!

odysa commented May 31, 2025 •

edited

Loading

odysa commented May 31, 2025 •

edited

Loading