perf: Add memory optimizations for JSON encoding and websocket buffers #489

jorgecuesta · 2025-11-28T17:34:14Z

Summary

Add sync.Pool-based JSON buffer pool for EVM responses to reduce allocations
- Pre-allocates 1KB buffers (covers ~79% of responses per production metrics)
- Buffers grow automatically for larger responses and get reused via pool
Make websocket message observation buffer size configurable via router config
- Default set to 100 (reduced from 1000 to prevent OOM: ~30MB vs 300MB at 100 connections)
- Exposed as websocket_message_buffer_size in router YAML config

Test plan

Unit tests pass (make test_unit)
Go linter passes (make go_lint)
E2E tests pass (make e2e_test eth - 90.33% success rate)
Race detector passes (go test -race ./qos/evm/...)
JSON pool tests added with buffer reuse and large payload tests
Config tests updated for new WebsocketMessageBufferSize field
Validated buffer size (1KB) against production metrics showing ~79% coverage

Addresses memory exhaustion issues causing 12GB RAM OOM crashes: 1. Add 100MB request body size limits (supports Solana's ~75MB blocks) 2. Cap endpoint observations per request (uses MaxConcurrentRelaysPerRequest) 3. Reduce WebSocket observation channel buffer from 1000 to 50 4. Add hydrator graceful shutdown with context cancellation 5. Add 30s timeouts to hydrator operations

Additional fixes completing the OOM prevention release: - 2.3 Session rollover: Add context for graceful shutdown of block height monitor - 2.4 Observation goroutines: Add 30s timeout to prevent indefinite hanging - 2.5 time.After leak: Replace with time.NewTimer + defer Stop() - 2.6 WebSocket cleanup: Close client connection if endpoint connection fails

Per JSON-RPC 2.0 spec (https://www.jsonrpc.org/specification), responses with null IDs are valid for error cases when the server couldn't parse the request ID. This is documented in Section 5 - Response object: "If there was an error in detecting the id in the Request object (e.g. Parse error/Invalid Request), it MUST be Null." Changes: - Update validateResponseIDs to treat null ID responses as "wildcards" that can match unmatched request IDs - Update createResponseObservations to skip null ID responses gracefully with debug logging instead of error logging - Downgrade "could not find request for response ID" from error to warn

Preserves original HTTP status codes from backend endpoints instead of transforming them based on JSON-RPC error codes. This allows clients to receive accurate HTTP status information (e.g., 429 Too Many Requests, 503 Service Unavailable) from backend services. Changes: - Update RequestQoSContext interface to include httpStatusCode parameter - Modify all QoS implementations (EVM, Solana, Cosmos, NoOp) to capture and propagate HTTP status codes - Change protocol/shannon to pass through non-2xx responses instead of returning errors - Make qos.HTTPResponse fields public for cross-package access

…rvices Adds `is_batch_request` label and batch size histogram to all QoS services: Cosmos: - Add `is_batch_request` label to requestsTotal metric - Add `cosmos_batch_request_size` histogram Solana: - Add `GetRequestMethods()` method to interpreter for batch support - Add `is_batch_request` label to requestsTotal metric - Add `solana_batch_request_size` histogram - Update PublishMetrics to iterate through methods like EVM/Cosmos EVM: - Add `evm_batch_request_size` histogram (already had is_batch_request) This enables consistent batch request visibility across all services: - Filter batch vs single requests in Prometheus - Analyze batch size distribution patterns - Capacity planning based on batch request patterns

- protocol/shannon/context.go: Use rc.context instead of context.TODO() in sendHTTPRequest to respect parent context cancellation signals. This ensures HTTP relay requests are properly cancelled when the parent request is cancelled. - cmd/main.go: Create unified backgroundCtx for pprof and hydrator services. This allows graceful shutdown of the pprof server which was previously unable to receive shutdown signals due to context.TODO().

- Replace panic with error return in hydrateRouterDefaults() for invalid config values (system overhead exceeding timeouts) - Fix error wrapping in websockets/bridge.go: use %w instead of %s to preserve error chains for errors.Is()/errors.As() - Add nil check for Session() before accessing Application in Shannon context to prevent nil pointer dereference - Improve invariant violation logging in sanctioned_endpoints_store with structured fields (cache_key, object_type) - Track and log aggregate failure counts in data reporter for better visibility into silent failures

- Switch to pointer receivers for GatewayConfig methods to align with Go best practices and improve consistency.

…ling - Add mutex protection to success path in handleSuccessfulResponse() to prevent data race when UpdateWithResponse is called concurrently - Return proper error instead of nil in ApplyHTTPObservations() when sanctioned endpoints store is not initialized (invariant violation)

- Add sync.Pool-based JSON buffer pool for EVM responses to reduce allocations - Pre-allocates 1KB buffers (covers ~79% of responses per production metrics) - Buffers grow automatically for larger responses and get reused - Make websocket message observation buffer size configurable via router config - Default set to 100 (reduced from 1000 to prevent OOM: ~30MB vs 300MB) - Exposed as WebsocketMessageBufferSize in router config - Add comprehensive tests for JSON buffer pooling and config validation

jorgecuesta added 11 commits November 28, 2025 11:10

chore: use 'gateway' instead of 'gtw' in comments and log messages

5b1ee23

refactor: Use pointer receivers

3887a44

- Switch to pointer receivers for GatewayConfig methods to align with Go best practices and improve consistency.

oten91 approved these changes Nov 28, 2025

View reviewed changes

oten91 merged commit e1f0fe4 into main Nov 29, 2025
13 checks passed

oten91 deleted the fix/memory-optimization branch November 29, 2025 01:05

oten91 restored the fix/memory-optimization branch November 29, 2025 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Add memory optimizations for JSON encoding and websocket buffers #489

perf: Add memory optimizations for JSON encoding and websocket buffers #489

Uh oh!

jorgecuesta commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: Add memory optimizations for JSON encoding and websocket buffers #489

perf: Add memory optimizations for JSON encoding and websocket buffers #489

Uh oh!

Conversation

jorgecuesta commented Nov 28, 2025

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants