Skip to content

Add CBOR document format support for gRPC bulk operations#1063

Draft
finnegancarroll wants to merge 2 commits into
opensearch-project:mainfrom
finnegancarroll:feature/cbor-document-format
Draft

Add CBOR document format support for gRPC bulk operations#1063
finnegancarroll wants to merge 2 commits into
opensearch-project:mainfrom
finnegancarroll:feature/cbor-document-format

Conversation

@finnegancarroll
Copy link
Copy Markdown
Contributor

@finnegancarroll finnegancarroll commented May 8, 2026

Description

Adds support for encoding documents in CBOR binary format when using gRPC bulk ingestion. CBOR provides more compact encoding than JSON, particularly for vector/float-heavy documents (~57% smaller for 768-dim vectors).

Changes:

  • Add cbor2 dependency
  • ProtoBulkHelper: add 'document-format' param (json|cbor) for NDJSON bodies
  • ProtoVectorBulkHelper: new helper for dict-list bodies (vector datasets)
  • ProtoBulkVectorDataSet: new runner + operation type for gRPC vector ingestion
  • Register param source for new operation type

Usage in workload operations:

  {"operation-type": "proto-bulk", "document-format": "cbor", ...} {"operation-type": "proto-bulk-vector-data-set", "document-format": "cbor", ...}

Benchmark results (5-node r5.xlarge cluster, real network):

  • http_logs: gRPC JSON 243K docs/s vs gRPC CBOR 132K docs/s (CBOR slower for small text docs due to server-side parsing overhead)
  • vectors (768-dim): gRPC JSON 2,578 docs/s vs gRPC CBOR 3,228 docs/s (CBOR 25% faster throughput, 34% less wall-clock time)

Issues Resolved

N/A

Testing

  • New functionality includes testing

Local testing and executing against a CDK cluster


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Adds support for encoding documents in CBOR binary format when using
gRPC bulk ingestion. CBOR provides more compact encoding than JSON,
particularly for vector/float-heavy documents (~57% smaller for
768-dim vectors).

Changes:
- Add cbor2 dependency
- ProtoBulkHelper: add 'document-format' param (json|cbor) for NDJSON bodies
- ProtoVectorBulkHelper: new helper for dict-list bodies (vector datasets)
- ProtoBulkVectorDataSet: new runner + operation type for gRPC vector ingestion
- Register param source for new operation type

Usage in workload operations:
  {"operation-type": "proto-bulk", "document-format": "cbor", ...}
  {"operation-type": "proto-bulk-vector-data-set", "document-format": "cbor", ...}

Benchmark results (5-node r5.xlarge cluster, real network):
- http_logs: gRPC JSON 243K docs/s vs gRPC CBOR 132K docs/s
  (CBOR slower for small text docs due to server-side parsing overhead)
- vectors (768-dim): gRPC JSON 2,578 docs/s vs gRPC CBOR 3,228 docs/s
  (CBOR 25% faster throughput, 34% less wall-clock time)

Signed-off-by: Finn Carroll <carrofin@amazon.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 6b0f8a8.

PathLineSeverityDescription
setup.py140highNew dependency 'cbor2==6.0.1' added. Per mandatory supply chain policy, all dependency additions must be flagged regardless of apparent legitimacy. Maintainers should verify the package name, version, and PyPI artifact hash before merging.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 1 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

Signed-off-by: Finn Carroll <carrofin@amazon.com>
@finnegancarroll
Copy link
Copy Markdown
Contributor Author

New dependency added: cbor2>=5.6.0. Per mandatory supply chain review policy, all dependency additions must be flagged for maintainer verification regardless of apparent legitimacy. Maintainers should confirm the package name, version range, and PyPI artifact integrity before merging.

Added this new dependency which needs maintainer review. The only package I found which could provide cbor binary format (could not find anything supporting Smile in python). MIT licensed.

PyPI: https://pypi.org/project/cbor2/
Source: https://github.com/agronholm/cbor2

Pinning to most recent version 6.0.1.

@finnegancarroll
Copy link
Copy Markdown
Contributor Author

Hi @rishabh6788 @OVI3D0 , can you take a look when you have a chance?
I've added a new dependency for serializing documents to cbor which is flagged for review here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant