Skip to content

Commit ab9de2b

Browse files
authored
feat: add configurable sanction durations (#490)
## Summary - Add `SanctionConfig` to allow operators to configure endpoint sanction timing - `session_sanction_duration`: How long session sanctions last (default: 1h) - `cache_cleanup_interval`: How often expired sanctions are purged (default: 10m) - Enables tuning based on network conditions ## Changes | File | Description | |------|-------------| | `protocol/shannon/config.go` | Add `SanctionConfig` struct with `HydrateDefaults()` | | `protocol/shannon/sanctioned_endpoints_store.go` | Use config values instead of hardcoded defaults | | `protocol/shannon/protocol.go` | Pass config to store constructors | | `protocol/shannon/sanctioned_endpoints_store_test.go` | Tests verifying sanctions expire after configured duration | | `docusaurus/docs/develop/configs/2_gateway_config.md` | Documentation for new config | | `config/examples/config.shannon_example.yaml` | Example configuration | | `config/config.schema.yaml` | Schema validation | ## Example Configuration ```yaml shannon_config: gateway_config: sanction_config: session_sanction_duration: 30m # Default: 1h cache_cleanup_interval: 5m # Default: 10m ``` ## Test plan - [x] Unit tests pass (`go test ./protocol/shannon/...`) - [x] New tests verify sanctions expire after configured duration - [x] Full build passes (`go build ./...`) - [x] Lint passes (`make go_lint`) - [x] E2E tests pass (`make e2e_test eth`)
1 parent e1f0fe4 commit ab9de2b

30 files changed

+5718
-101
lines changed

cmd/main.go

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,12 @@ func main() {
6464
// Log the config path
6565
logger.Info().Msgf("Starting PATH using config file: %s", configPath)
6666

67+
// Create a context for background services (pprof, hydrator, reputation) that can be canceled during shutdown.
68+
// This context is used to signal graceful shutdown to all background goroutines.
69+
backgroundCtx, backgroundCancel := context.WithCancel(context.Background())
70+
6771
// Create Shannon protocol instance (now the only supported protocol)
68-
protocol, err := getShannonProtocol(logger, config.GetGatewayConfig())
72+
protocol, err := getShannonProtocol(backgroundCtx, logger, config.GetGatewayConfig())
6973
if err != nil {
7074
log.Fatalf(`{"level":"fatal","error":"%v","message":"failed to create protocol"}`, err)
7175
}
@@ -82,10 +86,6 @@ func main() {
8286
log.Fatalf(`{"level":"fatal","error":"%v","message":"failed to start metrics server"}`, err)
8387
}
8488

85-
// Create a context for background services (pprof, hydrator) that can be canceled during shutdown.
86-
// This context is used to signal graceful shutdown to all background goroutines.
87-
backgroundCtx, backgroundCancel := context.WithCancel(context.Background())
88-
8989
// Setup the pprof server with the background context for graceful shutdown
9090
setupPprofServer(backgroundCtx, logger, pprofAddr)
9191

cmd/shannon.go

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
package main
22

33
import (
4+
"context"
45
"fmt"
56

67
"github.com/pokt-network/poktroll/pkg/polylog"
@@ -36,15 +37,15 @@ func getShannonFullNode(logger polylog.Logger, config *shannonconfig.ShannonGate
3637
}
3738

3839
// getShannonProtocol returns an instance of the Shannon protocol using the supplied Shannon-specific configuration.
39-
func getShannonProtocol(logger polylog.Logger, config *shannonconfig.ShannonGatewayConfig) (gateway.Protocol, error) {
40+
func getShannonProtocol(ctx context.Context, logger polylog.Logger, config *shannonconfig.ShannonGatewayConfig) (gateway.Protocol, error) {
4041
logger.Info().Msg("Starting PATH gateway with Shannon protocol")
4142

4243
fullNode, err := getShannonFullNode(logger, config)
4344
if err != nil {
4445
return nil, fmt.Errorf("failed to create a Shannon full node instance: %w", err)
4546
}
4647

47-
protocol, err := shannon.NewProtocol(logger, config.GatewayConfig, fullNode)
48+
protocol, err := shannon.NewProtocol(ctx, logger, config.GatewayConfig, fullNode)
4849
if err != nil {
4950
return nil, fmt.Errorf("failed to create a Shannon protocol instance: %w", err)
5051
}

config/config.schema.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,19 @@ properties:
149149
description: "Whether to send all traffic to fallback endpoints for this service, regardless of protocol endpoint health."
150150
type: boolean
151151
default: false
152+
sanction_config:
153+
description: "Configuration for the endpoint sanction system. Controls how long misbehaving endpoints are excluded from selection."
154+
type: object
155+
additionalProperties: false
156+
properties:
157+
session_sanction_duration:
158+
description: "Duration that session-based sanctions remain active. Endpoints with session sanctions will be excluded from selection for this duration. Format: Go duration string (e.g., '30m', '1h', '2h'). Default: 1h"
159+
type: string
160+
pattern: "^[0-9]+[smh]$"
161+
cache_cleanup_interval:
162+
description: "Interval for purging expired sanction entries from the cache. Format: Go duration string (e.g., '5m', '10m'). Default: 10m"
163+
type: string
164+
pattern: "^[0-9]+[smh]$"
152165
# Logger Configuration (optional)
153166
logger_config:
154167
description: "Optional configuration for the logger. If not specified, info level will be used."

config/config_test.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import (
1111
"github.com/pokt-network/path/network/grpc"
1212
"github.com/pokt-network/path/protocol"
1313
shannonprotocol "github.com/pokt-network/path/protocol/shannon"
14+
"github.com/pokt-network/path/reputation"
1415
)
1516

1617
// getTestDefaultGRPCConfig returns a GRPCConfig with default values applied
@@ -82,6 +83,17 @@ func Test_LoadGatewayConfigFromYAML(t *testing.T) {
8283
},
8384
},
8485
},
86+
SanctionConfig: shannonprotocol.SanctionConfig{
87+
SessionSanctionDuration: 30 * time.Minute,
88+
CacheCleanupInterval: 5 * time.Minute,
89+
},
90+
ReputationConfig: reputation.Config{
91+
Enabled: true,
92+
StorageType: "memory",
93+
InitialScore: 80,
94+
MinThreshold: 30,
95+
RecoveryTimeout: 5 * time.Minute,
96+
},
8597
},
8698
},
8799
Router: RouterConfig{

config/examples/config.shannon_example.yaml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,48 @@ shannon_config:
7474
# only `default_url` is specified and the other RPC type URLs are omitted.
7575
- default_url: "https://eth.rpc.backup.io"
7676

77+
# Optional sanction configuration
78+
# Controls how long misbehaving endpoints are excluded from selection
79+
sanction_config:
80+
# How long session-based sanctions last before the endpoint can be selected again
81+
# Default: 1h (1 hour)
82+
session_sanction_duration: 30m
83+
# How often expired sanctions are cleaned up from memory
84+
# Default: 10m (10 minutes)
85+
cache_cleanup_interval: 5m
86+
87+
# Optional reputation configuration
88+
# Provides gradual endpoint scoring based on reliability patterns
89+
# Works in addition to binary sanctions, not as a replacement
90+
reputation_config:
91+
# Enable/disable the reputation system
92+
# Default: false (only binary sanctions are used)
93+
enabled: true
94+
# Storage backend for reputation data
95+
# Options: "memory" (single instance) or "redis" (multi-instance deployments)
96+
# Default: "memory"
97+
storage_type: "memory"
98+
# Starting score for new endpoints (0-100 scale)
99+
# Default: 80
100+
initial_score: 80
101+
# Minimum score required for endpoint selection
102+
# Endpoints below this threshold are filtered out
103+
# Default: 30
104+
min_threshold: 30
105+
# Time after which inactive endpoint scores can be re-evaluated
106+
# Default: 5m
107+
recovery_timeout: 5m
108+
# Redis configuration (only used when storage_type is "redis")
109+
# redis:
110+
# address: "localhost:6379"
111+
# password: ""
112+
# db: 0
113+
# key_prefix: "path:reputation:"
114+
# pool_size: 10
115+
# dial_timeout: 5s
116+
# read_timeout: 3s
117+
# write_timeout: 3s
118+
77119
# Optional logger configuration
78120
logger_config:
79121
# Valid values are: debug, info, warn, error

docusaurus/docs/develop/configs/2_gateway_config.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,13 @@ shannon_config:
4646
fallback_urls:
4747
- "https://eth.rpc.grove.city/v1/1a2b3c4d"
4848
- "https://eth.rpc.grove.city/v1/5e6f7a8b"
49+
# (Optional) Endpoint reputation system
50+
reputation_config:
51+
enabled: true
52+
storage_type: "memory"
53+
initial_score: 80
54+
min_threshold: 30
55+
recovery_timeout: "5m"
4956

5057
# (Optional) Logger Configuration
5158
logger_config:
@@ -146,6 +153,8 @@ shannon_config:
146153
| `gateway_private_key_hex` | string | Yes | - | 64-character hex-encoded `secp256k1` gateway private key |
147154
| `owned_apps_private_keys_hex` | string[] | Only in centralized mode | - | List of 64-character hex-encoded `secp256k1` application private keys |
148155
| `service_fallback` | array | No | - | Array of service fallback configurations (see below for details) |
156+
| `sanction_config` | object | No | - | Configuration for endpoint sanction system (see below for details) |
157+
| `reputation_config` | object | No | - | Configuration for endpoint reputation system (see below for details) |
149158

150159
**`service_fallback` (optional)**
151160

@@ -181,6 +190,94 @@ TODO_DOCUMENT(@adshmh): Update this section to clarify the request distribution
181190
- **Protocol bypass**: Fallback endpoints bypass protocol-level validation and are sent directly to the configured URLs
182191
- **Service-specific**: Each service ID can have its own set of fallback endpoints
183192

193+
**`sanction_config` (optional)**
194+
195+
Configures the endpoint sanction system parameters. The sanction system temporarily excludes misbehaving endpoints from selection. When an endpoint returns errors or behaves poorly, it receives a "session sanction" that prevents it from being selected for requests until the sanction expires.
196+
197+
```yaml
198+
gateway_config:
199+
# ... other fields ...
200+
sanction_config:
201+
session_sanction_duration: "30m" # How long session sanctions last
202+
cache_cleanup_interval: "5m" # How often to purge expired sanctions
203+
```
204+
205+
| Field | Type | Required | Default | Description |
206+
| --------------------------- | ------ | -------- | ------- | -------------------------------------------------------------------------------------------------------- |
207+
| `session_sanction_duration` | string | No | "1h" | Duration that session-based sanctions remain active. Format: Go duration string (e.g., "30m", "1h", "2h") |
208+
| `cache_cleanup_interval` | string | No | "10m" | Interval for purging expired sanction entries from the cache. Format: Go duration string |
209+
210+
**Key Features:**
211+
- **Automatic expiration**: Session sanctions automatically expire after the configured duration
212+
- **Configurable timing**: Operators can tune sanction duration based on their network conditions
213+
- **Memory efficient**: Expired sanctions are periodically cleaned up to prevent memory bloat
214+
215+
**Use Cases:**
216+
- **Shorter durations** (e.g., `15m`): Use when endpoints frequently have temporary issues and you want faster recovery
217+
- **Longer durations** (e.g., `2h`): Use when you want to more aggressively exclude problematic endpoints
218+
- **Default** (`1h`): Balanced approach suitable for most deployments
219+
220+
**`reputation_config` (optional)**
221+
222+
Configures the endpoint reputation system. Unlike binary sanctions that simply exclude or include endpoints, the reputation system provides **gradual scoring** based on endpoint reliability patterns over time. This allows for more nuanced endpoint selection and softer handling of temporarily degraded endpoints.
223+
224+
```yaml
225+
gateway_config:
226+
# ... other fields ...
227+
reputation_config:
228+
enabled: true # Enable the reputation system
229+
storage_type: "memory" # Storage backend (currently only "memory" supported)
230+
initial_score: 80 # Starting score for new endpoints
231+
min_threshold: 30 # Score below which endpoints are filtered out
232+
recovery_timeout: "5m" # Time after which inactive endpoints can be re-evaluated
233+
```
234+
235+
| Field | Type | Required | Default | Description |
236+
| ------------------ | ------- | -------- | ------- | -------------------------------------------------------------------------------------------------- |
237+
| `enabled` | boolean | No | false | Whether to enable the reputation system. When false, only binary sanctions are used. |
238+
| `storage_type` | string | No | "memory"| Storage backend for reputation data. Currently only "memory" is supported. |
239+
| `initial_score` | float64 | No | 80 | Starting reputation score for new endpoints (0-100 scale). |
240+
| `min_threshold` | float64 | No | 30 | Minimum score required for an endpoint to be considered for selection. |
241+
| `recovery_timeout` | string | No | "5m" | Duration after which inactive endpoint scores can be re-evaluated. Format: Go duration string. |
242+
243+
**How Reputation Scoring Works:**
244+
245+
The reputation system records **signals** for each endpoint interaction:
246+
247+
| Signal Type | Impact | Description |
248+
| ---------------- | ------ | -------------------------------------------------------------- |
249+
| Success | +1 | Successful request/response |
250+
| Minor Error | -3 | Client errors, unknown errors (not endpoint's fault) |
251+
| Major Error | -10 | Timeouts, connection issues (recoverable) |
252+
| Critical Error | -25 | HTTP 5xx, validation errors (service degradation) |
253+
| Fatal Error | -50 | Service misconfiguration (previously "permanent sanction") |
254+
255+
**Key Features:**
256+
- **Gradual scoring**: Endpoints build or lose reputation over time based on actual performance
257+
- **Soft degradation**: Instead of immediately excluding endpoints, scores gradually decrease
258+
- **Recovery path**: Endpoints can recover reputation through consistent successful responses
259+
- **Works with sanctions**: Reputation filtering is applied **in addition to** binary sanctions, not as a replacement
260+
261+
**Prometheus Metrics:**
262+
263+
When reputation is enabled, the following metrics are exported:
264+
265+
| Metric Name | Type | Description |
266+
| -------------------------------------------- | --------- | -------------------------------------------------------- |
267+
| `path_shannon_reputation_signals_total` | Counter | Total signals by service_id, signal_type, endpoint_domain |
268+
| `path_shannon_reputation_endpoints_filtered_total` | Counter | Endpoints filtered vs allowed by service_id, action, domain |
269+
| `path_shannon_reputation_score_distribution` | Histogram | Distribution of endpoint scores by service_id |
270+
| `path_shannon_reputation_errors_total` | Counter | Errors in the reputation system by operation, error_type |
271+
272+
**Use Cases:**
273+
- **Production deployments**: Enable to get gradual endpoint scoring and better resilience
274+
- **Debugging**: Use metrics to identify consistently problematic endpoints or domains
275+
- **Tuning**: Adjust `min_threshold` based on your network's reliability patterns
276+
277+
:::warning E2E Testing
278+
When running E2E tests with reputation enabled, ensure `reputation_config.enabled: true` is set in your test configuration (e.g., `e2e/config/.shannon.config.yaml`). Without this, E2E tests will not exercise the reputation code path.
279+
:::
280+
184281
---
185282

186283
## `hydrator_config` (optional)

0 commit comments

Comments
 (0)