-
Notifications
You must be signed in to change notification settings - Fork 169
GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
e9dab06
to
ff4a793
Compare
This is quite good. We could even get better if we don't materialize anything. |
Let me further explore the materialize-nothing solution to see how fast it can be. |
Maybe we can parameterize the cardinality() function with |
Ok.
|
My cardinalityExact implementation. protected long cardinalityExact(final long subj, final long pred, final long obj, final long context)
throws IOException {
// get size of entire db
if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID && obj == LmdbValue.UNKNOWN_ID
&& context == LmdbValue.UNKNOWN_ID) {
return txnManager.doWith((stack, txn) -> {
long cardinality = 0;
final TripleIndex index = getBestIndex(subj, pred, obj, context);
for (boolean explicit : new boolean[] { true, false }) {
int dbi = index.getDB(explicit);
MDBStat stat = MDBStat.mallocStack(stack);
mdb_stat(txn, dbi, stat);
cardinality += stat.ms_entries();
}
return cardinality;
});
}
try (TxnManager.Txn txn = txnManager.createReadTxn()) {
final RecordIterator explicitIter = getTriples(txn, subj, pred, obj, context, true);
final RecordIterator implicitIter = getTriples(txn, subj, pred, obj, context, false);
long size = 0;
try {
for (long[] quad = explicitIter.next(); quad != null; quad = explicitIter.next()) {
size++;
}
for (long[] quad = implicitIter.next(); quad != null; quad = implicitIter.next()) {
size++;
}
} finally {
explicitIter.close();
implicitIter.close();
}
return size;
}
} |
ff4a793
to
6b638e0
Compare
2be6b32
to
c8c107a
Compare
For changeset: If there are no approved or deprecated changes, it uses a fast direct lookup. Otherwise, it falls back to iterating over matching statements, which remains performant since it avoids LMDB’s lazy evaluation. @Override
public long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
// Fast path: no approved or deprecated
if (!changes.hasApproved() && !changes.hasDeprecated()) {
return derivedFrom.size(subj, pred, obj, contexts);
}
// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation
return getStatements(subj, pred, obj, contexts).stream().count();
} |
c8c107a
to
8c4b328
Compare
Hi @odysa, the context size is also already tracked:
|
protected long calculateSize(final boolean includeInferred, final Resource... contexts) throws SailException { | ||
try (SailSource branch = branch(IncludeInferred.fromBoolean(includeInferred))) { | ||
return branch.dataset(getIsolationLevel()).size(null, null, null, contexts); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for why we need a new method here?
From a quick glance I saw that there is protected long sizeInternal(Resource... contexts) throws SailException {
which I wonder if would be a better place for this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization is only for LMDB, so I made 2 different pathes for size.
LMDB gets size directly from banch.dataset
Other sails call getStatementsInternal
to count.
I don't want to accidently break other sails, so leave sizeInternal
unchanged.
8c4b328
to
87a524d
Compare
Done getting size from tracked context size |
return derivedFrom.size(subj, pred, obj, contexts); | ||
} | ||
|
||
// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is a bit misleading - it actually remains fast by using (some kind of) lazy evaluation.
Maybe just use:
// Fallback path: iterate over all matching statements
Besides my comment it looks good to me. |
I need to take a closer look. It would be best to not have to make any changes to the common sail code. We have the ElasticsearchStore that could also leverage a faster way to count by offloading it to Elasticsearch. Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store. Finally I would like to see if we can't add a benchmark to our current suite so that this new optimisation won't get forgotten when we make changes in the future. Im a bit pressed for time right now though. I'll do my best to prioritise this. |
Lemme add more test cases to it. |
87a524d
to
44a16d2
Compare
Added isolationLevel test. |
44a16d2
to
e4c1475
Compare
GitHub issue resolved: #
#5343
Briefly describe the changes proposed in this PR:
This PR introduces an optimization for the
size(...)
method in the LMDBstore implementation. Introduce acardinalityExact
to calcualte the exact size, leverageingmdb_stats
when possible.Key Changes
ChangeSet
Fast-path Optimization
approved
ordeprecated
changes in the current transaction or not in a transcation, the method directly delegates to thederivedFrom
store for fast cardinality estimation.Fallback to Iterator
getStatements(...).stream().count()
. This bypasses LMDB’s lazy evaluation to ensure consistency, even with uncommitted changes.Low-level Size Calculation (
cardinalityExact
)mdb_stat
to return the total size efficiently.Perf
I created a LMDBSail with 10M triples.

Original
size()
: 21802msOptimized
size()
:685.6 μs to get the full size by leveraging
mdb_stats
.274.2 ms to get the size of a context of 5000000 triples.
PR Author Checklist (see the contributor guidelines for more details):
mvn process-resources
to format from the command line)