Unexpected performance results for jvector on 100M dataset compared to on-disk knn

Is there any benchmark performance comparison between the current jvector plugin and the [on-disk mode](https://docs.opensearch.org/latest/vector-search/optimizing-storage/disk-based-vector-search/) with knn plugin? I conducted a comparative test but did not observe a clear advantage of jvector in terms of performance and resource usage. Does this align with expectations?
On the SIFT 1M dataset, I can see the benefits of jvector. However, at the 100M scale, jvector's performance is an order of magnitude worse than the on-disk approach, even under memory-constrained conditions with high page misses.
<img width="1189" height="790" alt="Image" src="https://github.com/user-attachments/assets/e8a325ac-0aae-4058-a84c-3b25494f4701" />

Additionally, I observed that during query stress testing under memory constraints, jvector's disk I/O throughput is an order of magnitude higher than the on-disk method, along with higher JVM heap usage and storage space consumption by an order of magnitude.
My understanding is that jvector should demonstrate better competitive advantages at ultra-large data scales. However, the results above do not support this. Are there any other explanations or suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected performance results for jvector on 100M dataset compared to on-disk knn #213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected performance results for jvector on 100M dataset compared to on-disk knn #213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions