Skip to content

Conversation

@upgle
Copy link
Contributor

@upgle upgle commented Dec 25, 2025

Summary

This PR upgrades the Redis client library from radix v3 to radix v4.

radix v3 has entered maintenance mode (only accepting bug fixes), and v4 introduces several key improvements:

  • More polished API with struct-based configuration
  • Full RESP3 support
  • context.Context support on all blocking operations
  • Connection sharing that works with Pipeline and EvalScript

Changes

Core Changes

  • Upgrade github.com/mediocregopher/radix/v3 to github.com/mediocregopher/radix/v4
  • Migrate from function-based options (radix.DialOpt, radix.PoolOpt) to struct-based configuration (radix.Dialer, radix.PoolConfig)
  • Add context.Context parameter to all Redis operations (required in v4)
  • Update FlatCmd call signature to align with the v4 API

Pipeline Behavior Changes

Pipeline mode is now automatically selected based on Redis deployment type:

Cluster mode:

  • Groups commands by key and pipelines same-key commands together
  • INCRBY + EXPIRE for the same key are sent in one pipeline (same slot)
  • Reduces round-trips from 2 to 1 per key

Single/Sentinel mode:

  • Batches all commands in a single pipeline
  • Minimal latency for non-cluster deployments

Deprecated:

  • REDIS_PIPELINE_LIMIT - no longer has any effect (warning logged if set)

Pool Behavior Changes

radix v4 uses a fixed-size pool that blocks when exhausted. Note the changes in REDIS_POOL_ON_EMPTY_BEHAVIOR:

v3 Behavior v4 Behavior
WAIT Same (blocks until connection is available)
CREATE Not supported - will block instead of creating overflow connections
ERROR Not supported - will block instead of failing fast

Removed settings:

  • REDIS_POOL_ON_EMPTY_WAIT_DURATION - No longer used since v4 always blocks
  • REDIS_PERSECOND_POOL_ON_EMPTY_WAIT_DURATION - Same as above

Breaking Changes

  1. REDIS_PIPELINE_LIMIT is deprecated - No longer has any effect (warning logged if set)
  2. implicit pipeline is removed - Pipeline mode is now automatic based on Redis type
  3. REDIS_POOL_ON_EMPTY_WAIT_DURATION removed - No longer needed in v4
  4. Pool overflow not supported - If using REDIS_POOL_ON_EMPTY_BEHAVIOR=CREATE, increase REDIS_POOL_SIZE instead

Migration Guide

Most users need no changes. If you have custom settings:

Remove deprecated settings:

# Remove these if present
REDIS_PIPELINE_LIMIT=8
REDIS_POOL_ON_EMPTY_WAIT_DURATION=10s

Cluster mode continues to work:

REDIS_TYPE=cluster
REDIS_PIPELINE_WINDOW=150us  # Still supported

// ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
// defer cancel()
// client.Do(ctx, cmd)
// - Consider increasing REDIS_POOL_SIZE if using CREATE or ERROR previously
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return error or panic instead if we detect CREATE or ERROR for REDIS_POOL_ON_EMPTY_BEHAVIOR? Othewise, users who had set REDIS_POOL_ON_EMPTY_BEHAVIOR=ERROR/CREATE will just experience indefinite blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collin-lee Thank you for reviewing this.
I've updated the logic to panic/return an error when REDIS_POOL_ON_EMPTY_BEHAVIOR is set to CREATE or ERROR.

// RedisPerSecondSentinelAuth is the password for authenticating to per-second Redis Sentinel nodes.
// This is separate from RedisPerSecondAuth which is used for authenticating to the Redis master/replica nodes.
// If empty, no authentication will be attempted when connecting to per-second Sentinel nodes.
RedisPerSecondSentinelAuth string `envconfig:"REDIS_PERSECOND_SENTINEL_AUTH" default:""`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REDIS_PERSECOND_SENTINEL_AUTH should be mentioned in README.md too I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collin-lee
This PR focuses on radix v4 migration and doesn't change sentinel auth functionality.
And REDIS_PERSECOND_SENTINEL_AUTH is already documented in README.md.

Is there something specific you'd like to see added or clarified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I was just trying to review yesterday and wanted to make sure any newly introduced env variables are mentioned in README.md

@upgle
Copy link
Contributor Author

upgle commented Dec 26, 2025

Note: This benchmark focuses on Redis Single Instance. Performance evaluations for Redis Cluster are currently underway and will be shared in a future update.

Radix v4 Benchmark Results (Redis Single Instance)

Comprehensive performance testing comparing Radix v3 (main) vs Radix v4 under a controlled load of 500 RPS.

Executive Summary

Radix v4 delivers 53-55% latency reduction compared to the current main branch (Radix v3).

Branch Configuration Avg Latency p99 Latency Improvement
main (v3) Window-based 2.65ms 4.20ms baseline
radixv4 Write-buffer 2.15ms 3.89ms -19%
radixv4 Explicit-pipeline
+ Write-buffer
1.22ms 2.21ms -54%

Key Results

Fixed Key Scenario

main (v3):                         2.668ms ████████████████████████████
radixv4 (write-buffer):            2.123ms ██████████████████████ -20%
radixv4 (explicit+write-buffer):   1.189ms ████████████ -55%

Mixed2 Scenario (1 fixed + 1 variable key)

main (v3):                         2.638ms ███████████████████████████
radixv4 (write-buffer):            2.171ms ██████████████████████ -18%
radixv4 (explicit+write-buffer):   1.254ms ████████████ -52%

Detailed Comparison Table

Click to expand: Complete benchmark results

Fixed Key Scenario (500 RPS)

Branch Config RPS Avg p50 p75 p90 p95 p99 p999
main (v3) Window 500.00 2.668ms 2.563ms 2.891ms 3.331ms 3.538ms 4.283ms 7.459ms
radixv4 Write-buffer 499.90 2.123ms 2.001ms 2.32ms 2.715ms 3.005ms 3.781ms 7.626ms
radixv4 Explicit
+ Write-buffer
499.96 1.189ms 1.15ms 1.289ms 1.439ms 1.567ms 1.995ms 4.466ms

Improvements (radixv4 Explicit+Write-buffer vs. radix v3 Window):

  • Avg: -55.4% | p50: -55.1% | p95: -55.7% | p99: -53.4%

Mixed2 Scenario (500 RPS)

Branch Config RPS Avg p50 p75 p90 p95 p99 p999
main (v3) Window 500.00 2.638ms 2.527ms 2.907ms 3.317ms 3.507ms 4.125ms 6.62ms
radixv4 Write-buffer 499.96 2.171ms 2.056ms 2.287ms 2.651ms 3.023ms 3.99ms 16.189ms
radixv4 Explicit
+ Write-buffer
499.90 1.254ms 1.21ms 1.355ms 1.519ms 1.653ms 2.43ms 4.565ms

Improvements (radixv4 Explicit+Write-buffer vs. radix v3 Window):

  • Avg: -52.5% | p50: -52.1% | p95: -52.9% | p99: -41.1%

Test Configuration

Click to expand: Test environment and settings

Test Environment

  • Redis: Single instance (Docker, redis:7-alpine, port 6379)
  • Platform: macOS (Darwin 25.1.0)
  • Go Version: 1.23
  • Test Duration: 30s per scenario (with 2s warmup)
  • Target Rate: 500 RPS (fixed rate, controlled load)

gRPC Client Configuration

Workers (Concurrency):     20
gRPC Connections:          5
Target Rate:               500 req/s
Duration:                  30s
Warmup:                    2s

Worker count calculation (Little's Law): For 500 RPS at ~10ms expected latency, concurrent requests needed = 500 × 0.01s = 5. Workers = 5 × 4 (headroom) = 20, gRPC connections = 20 / 4 = 5.

Redis Configuration

main branch (Radix v3):

REDIS_TYPE=single
REDIS_URL=localhost:6379
REDIS_PIPELINE_WINDOW=150us
REDIS_PIPELINE_LIMIT=0          # No explicit limit
REDIS_POOL_SIZE=20
REDIS_POOL_ON_EMPTY_BEHAVIOR=WAIT # To fix pool size

radixv4 - Write-buffer :

REDIS_TYPE=single
REDIS_URL=localhost:6379
REDIS_USE_EXPLICIT_PIPELINE=false
REDIS_PIPELINE_WINDOW=150us
REDIS_POOL_SIZE=20

radixv4 - Explicit Pipeline + Write-buffer :

REDIS_TYPE=single
REDIS_URL=localhost:6379
REDIS_USE_EXPLICIT_PIPELINE=true    # Key difference
REDIS_PIPELINE_WINDOW=150us
REDIS_POOL_SIZE=20

Test Scenarios Explained

Click to expand: What are "fixed" and "mixed2" scenarios?

Scenario 1: fixed (Cache-friendly)

Workload: All requests use the same descriptor key-value pair.

gRPC Request:

descriptors: [
  {
    entries: [
      { key: "api_key", value: "fixed_key" }
    ]
  }
]

Rate Limit Config (config.yaml):

- key: api_key
  value: fixed_key
  rate_limit:
    unit: minute
    requests_per_unit: 100000000  # Very high to avoid rate limiting

Redis Behavior:

  • All requests increment the same Redis key

Scenario 2: mixed2 (Realistic workload)

Workload: Each request uses 1 fixed key + 1 variable key (nested descriptors).

gRPC Request:

descriptors: [
  {
    entries: [
      { key: "nested_fixed_1", value: "value_1" },
      { key: "var_1", value: "random_uuid_xyz123" }  # Random value per request
    ]
  }
]

Rate Limit Config (config.yaml):

- key: nested_fixed_1
  value: value_1
  descriptors:
    - key: var_1          # No value = wildcard match
      rate_limit:
        unit: minute
        requests_per_unit: 100000000

Redis Behavior:

  • Each request creates a unique Redis key (due to random var_1 value)

@upgle
Copy link
Contributor Author

upgle commented Dec 27, 2025

I've improved performance in Redis Cluster as well by automatically applying pipelining for operations on the same key. Please refer to this commit: 2f597fd

Radix v4 Benchmark Results (Redis Cluster 🕸️)

Comprehensive performance testing comparing Radix v3 (main) vs Radix v4 in Redis Cluster mode under controlled 500 RPS load.

Executive Summary

Radix v4 delivers 35-64% latency reduction across all scenarios compared to main branch in Redis Cluster deployments.

Branch Scenario Avg Latency p99 Latency Improvement
main (v3) fixed 3.45ms 7.34ms baseline
radixv4 fixed 1.61ms 4.96ms -53%
main (v3) mixed2 5.70ms 15.07ms baseline
radixv4 mixed2 2.08ms 5.83ms -64%
main (v3) mixed10 75.25ms 127.35ms baseline
radixv4 mixed10 48.44ms 75.35ms -36%

Key Results

percentile_comparison

Detailed Comparison Table

Click to expand: Complete benchmark results

Fixed Key Scenario (500 RPS)

Branch RPS Avg p50 p95 p99
main (v3) 500.00 3.45ms 3.33ms 4.77ms 7.34ms
radixv4 500.00 1.61ms 1.38ms 2.90ms 4.96ms

Improvements:

  • Avg: -53.4% | p50: -58.5% | p95: -39.2% | p99: -32.3%

Mixed2 Scenario (500 RPS)

Branch RPS Avg p50 p95 p99
main (v3) 500.00 5.70ms 5.12ms 10.01ms 15.07ms
radixv4 500.00 2.08ms 1.82ms 3.74ms 5.83ms

Improvements:

  • Avg: -63.5% | p50: -64.5% | p95: -62.6% | p99: -61.3%

Mixed10 Scenario (Note: RPS limited by workload complexity)

Branch Avg RPS Avg p50 p95 p99
main (v3) ~260 75.25ms 72.88ms 101.28ms 127.35ms
radixv4 ~390 48.44ms 47.52ms 65.29ms 75.35ms

Improvements:

  • Throughput: +50% | Avg: -35.6% | p50: -34.8% | p95: -35.5% | p99: -40.8%

Note: mixed10 scenario could not sustain 500 RPS due to workload complexity (10 keys per request). Radix v4 shows significant throughput improvement (+50%) in addition to latency reduction.


Test Configuration

Click to expand: Test environment and settings

Test Environment

  • Redis: 3-node cluster (Docker, redis:7-alpine, ports 7001-7003)
  • Platform: macOS (Darwin 25.1.0)
  • Go Version: 1.23
  • Test Duration: 30s per scenario (with 2s warmup)
  • Target Rate: 500 RPS (fixed rate, controlled load)
  • Runs per scenario: 3 (outliers excluded from averages)

gRPC Client Configuration

Workers (Concurrency):     20
gRPC Connections:          5
Target Rate:               500 req/s
Duration:                  30s
Warmup:                    2s

Redis Configuration

main branch (Radix v3):

REDIS_TYPE=cluster
REDIS_URL=localhost:7001,localhost:7002,localhost:7003
REDIS_PIPELINE_WINDOW=150us
REDIS_PIPELINE_LIMIT=8
REDIS_POOL_SIZE=10

radixv4:

REDIS_TYPE=cluster
REDIS_URL=localhost:7001,localhost:7002,localhost:7003
REDIS_PIPELINE_WINDOW=150us
REDIS_POOL_SIZE=10
# Note: REDIS_PIPELINE_LIMIT removed (deprecated in v4)

Test Scenarios Explained

Click to expand: Scenario descriptions

Scenario 1: fixed (Cache-friendly)

All requests use the same descriptor key-value pair.

gRPC Request:

descriptors: [
  {
    entries: [
      { key: "api_key", value: "fixed_key" }
    ]
  }
]

Redis Behavior:

  • All requests increment the same Redis key
  • Best-case scenario for caching

Scenario 2: mixed2 (Realistic workload)

Each request uses 1 fixed key + 1 variable key (nested descriptors).

gRPC Request:

descriptors: [
  {
    entries: [
      { key: "nested_fixed_1", value: "value_1" },
      { key: "var_1", value: "random_uuid_xyz123" }  # Random value per request
    ]
  }
]

Redis Behavior:

  • Each request creates a unique Redis key
  • Simulates per-user rate limiting

Scenario 3: mixed10 (Heavy workload)

Each request uses 5 fixed keys + 5 variable keys.

gRPC Request:

descriptors: [
  {
    entries: [
      { key: "fixed_1", value: "value_1" },
      { key: "fixed_2", value: "value_2" },
      ...
      { key: "var_1", value: "random_uuid_1" },
      { key: "var_2", value: "random_uuid_2" },
      ...
    ]
  }
]

Redis Behavior:

  • 10 Redis operations per request
  • Tests cluster slot routing efficiency

upgle added 14 commits December 28, 2025 01:28
Upgrade radix Redis client from v3.8.1 to v4.1.4.

Main changes:
- Import paths: radix/v3 -> radix/v4
- Pool/Cluster/Sentinel use Config.New() instead of New()
- All client operations require context.Context parameter
- Dialer setup changed from functional options to struct config
- Pipelining uses radix.NewPipeline() and Append()
- Write buffering via Dialer.WriteFlushInterval

Breaking from v3:
- Pool on-empty behavior (WAIT/CREATE/ERROR) not available
- REDIS_PIPELINE_LIMIT setting deprecated (no effect in v4)

Tested with existing test suite - all tests passing.

Signed-off-by: seonghyun <[email protected]>
Update documentation to reflect radix v4's pipeline behavior:

- REDIS_PIPELINE_WINDOW now sets WriteFlushInterval (auto-flush timing)
- REDIS_PIPELINE_LIMIT deprecated - no effect in v4
- Add REDIS_USE_EXPLICIT_PIPELINE for manual pipeline control
- Required for Redis Cluster: PIPELINE_WINDOW must be non-zero

Update terminology from "implicit pipelining" to "write buffering"
to better match radix v4's actual behavior.

Signed-off-by: seonghyun <[email protected]>
- Add useExplicitPipeline parameter to test client creation
- Update error assertions for v4's error message format
  (v4 prefixes with "response returned from Conn:")
- Handle different connection errors (EOF, connection reset, broken pipe)
- Update radix.FlatCmd usage for v4 API

Signed-off-by: seonghyun <[email protected]>
Replace deprecated RedisPipelineLimit with RedisPipelineWindow in
configRedisCluster function. Radix v4 requires WriteFlushInterval
(RedisPipelineWindow) for cluster mode buffering instead of the
deprecated pipeline limit setting.

Signed-off-by: seonghyun <[email protected]>
Radix v4 does not support CREATE or ERROR behaviors for
REDIS_POOL_ON_EMPTY_BEHAVIOR. Previously, these settings were logged
as errors but the application would continue with blocking behavior,
which could cause unexpected issues in production.

Changes:
- Panic at startup when CREATE or ERROR is detected
- Prevent silent behavior changes that could cause blocking
- Update tests to verify panic behavior
- Improve migration documentation in comments

This ensures users are immediately notified of incompatible
configuration rather than experiencing unexpected blocking in production.

Signed-off-by: seonghyun <[email protected]>
The default value 'CREATE' is not supported in radix v4 and causes
integration tests to panic at startup. Changed default to 'WAIT' which
matches radix v4's actual pool behavior (always blocks when empty).

This fixes integration test failures where tests without explicit
REDIS_POOL_ON_EMPTY_BEHAVIOR settings would panic during initialization
with: "REDIS_POOL_ON_EMPTY_BEHAVIOR=CREATE is not supported in radix v4"

Also updated documentation to clarify that CREATE/ERROR are not supported
and marked RedisPoolOnEmptyWaitDuration as deprecated.

Signed-off-by: seonghyun <[email protected]>
- Fix WaitForTcpPort to use timeoutCtx instead of ctx
  This ensures the timeout parameter is actually respected when
  dialing TCP connections.

- Increase gRPC server startup timeout from 1s to 10s
  Radix v4 cluster connection initialization takes longer,
  especially when establishing connections to multiple cluster nodes.
  This prevents "connection refused" errors in integration tests.

Signed-off-by: seonghyun <[email protected]>
Consolidates Redis and Sentinel dialer setup into a reusable createDialer
helper function, eliminating ~30 lines of duplicated code. Improves logging
by including connection target details (e.g., "sentinel(master,host1,host2)")
instead of generic "sentinel" string.

Signed-off-by: seonghyun <[email protected]>
Remove the deprecated poolOnEmptyWaitDuration parameter and related
configuration settings as they have no effect in radix v4. The pool
always blocks until a connection is available when using WAIT behavior.

Signed-off-by: seonghyun <[email protected]>
Remove REDIS_USE_EXPLICIT_PIPELINE configuration option and
automatically determine pipeline mode based on Redis deployment type:

- Cluster mode: uses grouped pipeline (groups same-key commands)
  - INCRBY + EXPIRE for same key are pipelined together (same slot)
  - Reduces round-trips from 2 to 1 per key in cluster mode

- Single/Sentinel mode: uses explicit pipeline (batches all commands)
  - All commands in one pipeline for minimal latency
  - Optimal for non-cluster deployments

This simplifies configuration by removing user-facing options while
automatically choosing the optimal pipeline strategy for each Redis type.

Breaking changes:
- Remove REDIS_USE_EXPLICIT_PIPELINE env var
- Remove REDIS_PERSECOND_USE_EXPLICIT_PIPELINE env var
- Remove UseExplicitPipeline() interface method

Signed-off-by: seonghyun <[email protected]>
The REDIS_USE_EXPLICIT_PIPELINE and REDIS_PERSECOND_USE_EXPLICIT_PIPELINE
settings were documented in README but do not exist in settings.go.
Removed the documentation to match the actual implementation.

Signed-off-by: seonghyun <[email protected]>
@upgle
Copy link
Contributor Author

upgle commented Dec 27, 2025

@collin-lee
Force-pushed to comply with DCO requirements. All 14 commits now include the required Signed-off-by line. Please review at your convenience. Thank you!

Copy link
Contributor

@collin-lee collin-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

FYI @arkodg

@upgle
Copy link
Contributor Author

upgle commented Dec 28, 2025

@collin-lee
Thank you for your quick review.
My team is working on performance tuning to handle tens to hundreds of thousands of requests per second.
In addition, I'll propose a few additional improvements as issues and open draft PRs for them.

@collin-lee collin-lee merged commit e9ce92c into envoyproxy:main Dec 31, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants