Add CBOR document format support for gRPC bulk operations#1063
Add CBOR document format support for gRPC bulk operations#1063finnegancarroll wants to merge 2 commits into
Conversation
Adds support for encoding documents in CBOR binary format when using
gRPC bulk ingestion. CBOR provides more compact encoding than JSON,
particularly for vector/float-heavy documents (~57% smaller for
768-dim vectors).
Changes:
- Add cbor2 dependency
- ProtoBulkHelper: add 'document-format' param (json|cbor) for NDJSON bodies
- ProtoVectorBulkHelper: new helper for dict-list bodies (vector datasets)
- ProtoBulkVectorDataSet: new runner + operation type for gRPC vector ingestion
- Register param source for new operation type
Usage in workload operations:
{"operation-type": "proto-bulk", "document-format": "cbor", ...}
{"operation-type": "proto-bulk-vector-data-set", "document-format": "cbor", ...}
Benchmark results (5-node r5.xlarge cluster, real network):
- http_logs: gRPC JSON 243K docs/s vs gRPC CBOR 132K docs/s
(CBOR slower for small text docs due to server-side parsing overhead)
- vectors (768-dim): gRPC JSON 2,578 docs/s vs gRPC CBOR 3,228 docs/s
(CBOR 25% faster throughput, 34% less wall-clock time)
Signed-off-by: Finn Carroll <carrofin@amazon.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 6b0f8a8.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Added this new dependency which needs maintainer review. The only package I found which could provide cbor binary format (could not find anything supporting Smile in python). MIT licensed. PyPI: https://pypi.org/project/cbor2/ Pinning to most recent version 6.0.1. |
|
Hi @rishabh6788 @OVI3D0 , can you take a look when you have a chance? |
Description
Adds support for encoding documents in CBOR binary format when using gRPC bulk ingestion. CBOR provides more compact encoding than JSON, particularly for vector/float-heavy documents (~57% smaller for 768-dim vectors).
Changes:
Usage in workload operations:
Benchmark results (5-node r5.xlarge cluster, real network):
Issues Resolved
N/A
Testing
Local testing and executing against a CDK cluster
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.