|
| 1 | +# OpenTelemetry Integration Design for ToolHive MCP Server Proxies |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +ToolHive currently lacks observability into MCP server interactions, making it difficult to: |
| 6 | +- Debug MCP protocol issues |
| 7 | +- Monitor performance and reliability |
| 8 | +- Track usage patterns and errors |
| 9 | +- Correlate issues across the proxy-container boundary |
| 10 | + |
| 11 | +## Goals |
| 12 | + |
| 13 | +- Add comprehensive OpenTelemetry instrumentation to MCP server proxies |
| 14 | +- Provide traces, metrics, and structured logging for all MCP interactions |
| 15 | +- Maintain backward compatibility and minimal performance impact |
| 16 | +- Support standard OTEL backends (Jaeger, Honeycomb, DataDog, etc.) |
| 17 | + |
| 18 | +## Non-Goals |
| 19 | + |
| 20 | +- Instrumenting MCP servers themselves (only the proxy layer) |
| 21 | +- Custom telemetry formats or proprietary backends |
| 22 | +- Breaking changes to existing APIs |
| 23 | + |
| 24 | +## Architecture Overview |
| 25 | + |
| 26 | +ToolHive uses HTTP proxies to front MCP servers running in containers: |
| 27 | + |
| 28 | +``` |
| 29 | +Client → HTTP Proxy → Container (MCP Server) |
| 30 | + ↑ |
| 31 | + OTEL Middleware |
| 32 | +``` |
| 33 | + |
| 34 | +Two transport modes exist: |
| 35 | +1. **SSE Transport**: `TransparentProxy` forwards HTTP directly to containers |
| 36 | +2. **Stdio Transport**: `HTTPSSEProxy` bridges HTTP/SSE to container stdio |
| 37 | + |
| 38 | +Both use the `types.Middleware` interface, providing a clean integration point. |
| 39 | + |
| 40 | +## Detailed Design |
| 41 | + |
| 42 | +### 1. Telemetry Provider (`pkg/telemetry`) |
| 43 | + |
| 44 | +Create a new telemetry package that provides: |
| 45 | + |
| 46 | +```go |
| 47 | +type Config struct { |
| 48 | + Enabled bool |
| 49 | + Endpoint string |
| 50 | + ServiceName string |
| 51 | + ServiceVersion string |
| 52 | + SamplingRate float64 |
| 53 | + Headers map[string]string |
| 54 | + Insecure bool |
| 55 | +} |
| 56 | + |
| 57 | +type Provider struct { |
| 58 | + tracerProvider trace.TracerProvider |
| 59 | + meterProvider metric.MeterProvider |
| 60 | +} |
| 61 | + |
| 62 | +func NewProvider(ctx context.Context, config Config) (*Provider, error) |
| 63 | +func (p *Provider) Middleware() types.Middleware |
| 64 | +func (p *Provider) Shutdown(ctx context.Context) error |
| 65 | +``` |
| 66 | + |
| 67 | +The provider initializes OpenTelemetry with proper resource attribution and configures exporters for OTLP endpoints. It handles graceful shutdown and provides HTTP middleware for instrumentation. |
| 68 | + |
| 69 | +### 2. HTTP Middleware Implementation |
| 70 | + |
| 71 | +The middleware wraps HTTP handlers to provide comprehensive instrumentation: |
| 72 | + |
| 73 | +**Request Processing:** |
| 74 | +- Extract HTTP metadata (method, URL, headers) |
| 75 | +- Start trace spans with semantic conventions |
| 76 | +- Parse request bodies for MCP protocol information |
| 77 | +- Record request metrics and active connections |
| 78 | + |
| 79 | +**Response Processing:** |
| 80 | +- Capture response metadata (status, size, duration) |
| 81 | +- Record completion metrics |
| 82 | +- Finalize spans with response attributes |
| 83 | + |
| 84 | +**Error Handling:** |
| 85 | +The middleware never fails the underlying request, even if telemetry operations encounter errors. All operations use timeouts and circuit breakers. |
| 86 | + |
| 87 | +### 3. MCP Protocol Instrumentation |
| 88 | + |
| 89 | +Enhanced instrumentation for JSON-RPC calls: |
| 90 | + |
| 91 | +```go |
| 92 | +func extractMCPMethod(body []byte) (method, id string, err error) |
| 93 | +func addMCPAttributes(span trace.Span, method string, serverName string) |
| 94 | +``` |
| 95 | + |
| 96 | +This extracts MCP-specific information like method names (tools/list, resources/read), request IDs, and error codes to provide protocol-level observability beyond HTTP metrics. |
| 97 | + |
| 98 | +### 4. Configuration Integration |
| 99 | + |
| 100 | +Add CLI flags to existing commands: |
| 101 | + |
| 102 | +```bash |
| 103 | +--otel-enabled # Enable OpenTelemetry |
| 104 | +--otel-endpoint # OTLP endpoint URL |
| 105 | +--otel-service-name # Service name (default: toolhive-mcp-proxy) |
| 106 | +--otel-sampling-rate # Trace sampling rate (0.0-1.0) |
| 107 | +--otel-headers # Authentication headers |
| 108 | +--otel-insecure # Disable TLS verification |
| 109 | +``` |
| 110 | + |
| 111 | +Environment variable support: |
| 112 | +```bash |
| 113 | +TOOLHIVE_OTEL_ENABLED=true |
| 114 | +TOOLHIVE_OTEL_ENDPOINT=https://api.honeycomb.io |
| 115 | +TOOLHIVE_OTEL_HEADERS="x-honeycomb-team=your-api-key" |
| 116 | +``` |
| 117 | + |
| 118 | +### 5. Integration Points |
| 119 | + |
| 120 | +**Run Command Integration:** |
| 121 | +The `thv run` command creates telemetry providers when OTEL is enabled and adds the middleware to the chain alongside authentication middleware. |
| 122 | + |
| 123 | +**Proxy Command Integration:** |
| 124 | +The standalone `thv proxy` command receives similar integration for proxy-only deployments. |
| 125 | + |
| 126 | +**Transport Integration:** |
| 127 | +Both SSE and stdio transports automatically inherit telemetry through the middleware system. |
| 128 | + |
| 129 | +## Data Model |
| 130 | + |
| 131 | +### Trace Attributes |
| 132 | + |
| 133 | +**HTTP Layer:** |
| 134 | +``` |
| 135 | +service.name: toolhive-mcp-proxy |
| 136 | +service.version: 1.0.0 |
| 137 | +http.method: POST |
| 138 | +http.url: http://localhost:8080/sse |
| 139 | +http.status_code: 200 |
| 140 | +``` |
| 141 | + |
| 142 | +**MCP Layer:** |
| 143 | +``` |
| 144 | +mcp.server.name: github |
| 145 | +mcp.server.image: ghcr.io/example/github-mcp:latest |
| 146 | +mcp.transport: sse |
| 147 | +mcp.method: tools/list |
| 148 | +mcp.request.id: 123 |
| 149 | +rpc.system: jsonrpc |
| 150 | +container.id: abc123def456 |
| 151 | +``` |
| 152 | + |
| 153 | +### Metrics |
| 154 | + |
| 155 | +``` |
| 156 | +# Request count |
| 157 | +toolhive_mcp_requests_total{method="POST",status_code="200",mcp_method="tools/list",server="github"} |
| 158 | + |
| 159 | +# Request duration |
| 160 | +toolhive_mcp_request_duration_seconds{method="POST",mcp_method="tools/list",server="github"} |
| 161 | + |
| 162 | +# Active connections |
| 163 | +toolhive_mcp_active_connections{server="github",transport="sse"} |
| 164 | +``` |
| 165 | + |
| 166 | +## Implementation Plan |
| 167 | + |
| 168 | +### Phase 1: Core Infrastructure |
| 169 | +- Create `pkg/telemetry` package |
| 170 | +- Implement basic HTTP middleware |
| 171 | +- Add CLI flags and configuration |
| 172 | +- Integration with `run` and `proxy` commands |
| 173 | + |
| 174 | +### Phase 2: MCP Protocol Support |
| 175 | +- JSON-RPC message parsing |
| 176 | +- MCP-specific span attributes |
| 177 | +- Enhanced metrics with MCP context |
| 178 | + |
| 179 | +### Phase 3: Production Readiness |
| 180 | +- Performance optimization |
| 181 | +- Error handling and graceful degradation |
| 182 | +- Documentation and examples |
| 183 | + |
| 184 | +### Phase 4: Advanced Features |
| 185 | +- Custom dashboards and alerts |
| 186 | +- Sampling strategies |
| 187 | +- Advanced correlation features |
| 188 | + |
| 189 | +## Security Considerations |
| 190 | + |
| 191 | +- **Data Sanitization**: Exclude sensitive headers and request bodies from traces |
| 192 | +- **Sampling**: Default to 10% sampling to control costs and overhead |
| 193 | +- **Authentication**: Support standard OTLP authentication headers |
| 194 | +- **Graceful Degradation**: Continue normal operation if telemetry endpoint is unavailable |
| 195 | + |
| 196 | +## Implementation Benefits |
| 197 | + |
| 198 | +1. **Non-Intrusive**: Leverages existing middleware system without architectural changes |
| 199 | +2. **Comprehensive Coverage**: Captures all MCP traffic through proxy instrumentation |
| 200 | +3. **Flexible Configuration**: Supports various OTEL backends |
| 201 | +4. **Production Ready**: Includes sampling, authentication, and graceful degradation |
| 202 | +5. **MCP-Aware**: Provides protocol-specific insights beyond generic HTTP metrics |
| 203 | + |
| 204 | +## Success Metrics |
| 205 | + |
| 206 | +- Zero performance regression in proxy throughput |
| 207 | +- Complete trace coverage for all MCP interactions |
| 208 | +- Successful integration with major OTEL backends |
| 209 | +- Positive feedback from operators on debugging capabilities |
| 210 | + |
| 211 | +## Alternatives Considered |
| 212 | + |
| 213 | +1. **Container-level instrumentation**: Rejected due to complexity and MCP server diversity |
| 214 | +2. **Custom telemetry format**: Rejected in favor of OTEL standards |
| 215 | +3. **Sidecar approach**: Rejected due to deployment complexity |
| 216 | + |
| 217 | +The middleware-based approach is elegant because it leverages existing infrastructure while providing comprehensive observability. OpenTelemetry packages are already available as indirect dependencies, so no major dependency changes are required. |
| 218 | + |
| 219 | +## Prometheus Integration |
| 220 | + |
| 221 | +Prometheus can be integrated with this OpenTelemetry design through multiple pathways: |
| 222 | + |
| 223 | +### 1. OTEL Collector → Prometheus (Recommended) |
| 224 | + |
| 225 | +The most robust approach uses the OpenTelemetry Collector as an intermediary: |
| 226 | + |
| 227 | +``` |
| 228 | +ToolHive Proxy → OTEL Collector → Prometheus |
| 229 | +``` |
| 230 | + |
| 231 | +**How it works:** |
| 232 | +- ToolHive sends metrics via OTLP to an OTEL Collector |
| 233 | +- The collector exports metrics to Prometheus using the `prometheusexporter` |
| 234 | +- Prometheus scrapes the collector's `/metrics` endpoint |
| 235 | + |
| 236 | +**Benefits:** |
| 237 | +- Centralized metric processing and transformation |
| 238 | +- Can aggregate metrics from multiple ToolHive instances |
| 239 | +- Supports metric filtering, renaming, and enrichment |
| 240 | +- Provides a buffer if Prometheus is temporarily unavailable |
| 241 | + |
| 242 | +### 2. Direct Prometheus Exporter |
| 243 | + |
| 244 | +ToolHive could expose metrics directly via a Prometheus endpoint: |
| 245 | + |
| 246 | +``` |
| 247 | +ToolHive Proxy → Prometheus (direct scrape) |
| 248 | +``` |
| 249 | + |
| 250 | +**Implementation:** |
| 251 | +- Add a Prometheus exporter alongside the OTLP exporter |
| 252 | +- Expose `/metrics` endpoint on each proxy instance |
| 253 | +- Configure Prometheus to scrape ToolHive instances directly |
| 254 | + |
| 255 | +**Configuration Addition:** |
| 256 | +```bash |
| 257 | +--prometheus-enabled # Enable Prometheus metrics endpoint |
| 258 | +--prometheus-port 9090 # Port for /metrics endpoint |
| 259 | +--prometheus-path /metrics # Metrics endpoint path |
| 260 | +``` |
| 261 | + |
| 262 | +### 3. Prometheus Metric Examples |
| 263 | + |
| 264 | +Metrics would follow Prometheus conventions: |
| 265 | + |
| 266 | +```prometheus |
| 267 | +# Counter metrics |
| 268 | +toolhive_mcp_requests_total{server="github",method="tools_list",status="success",transport="sse"} 42 |
| 269 | +
|
| 270 | +# Histogram metrics |
| 271 | +toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_list",le="0.1"} 10 |
| 272 | +toolhive_mcp_request_duration_seconds_sum{server="github",method="tools_list"} 12.5 |
| 273 | +toolhive_mcp_request_duration_seconds_count{server="github",method="tools_list"} 42 |
| 274 | +
|
| 275 | +# Gauge metrics |
| 276 | +toolhive_mcp_active_connections{server="github",transport="sse"} 5 |
| 277 | +``` |
| 278 | + |
| 279 | +### 4. Example: tools/call Trace and Metrics |
| 280 | + |
| 281 | +Here's how a `tools/call` MCP method would appear in traces and metrics: |
| 282 | + |
| 283 | +**Distributed Trace:** |
| 284 | +``` |
| 285 | +Span: mcp.proxy.request |
| 286 | +├── service.name: toolhive-mcp-proxy |
| 287 | +├── service.version: 1.0.0 |
| 288 | +├── http.method: POST |
| 289 | +├── http.url: http://localhost:8080/messages?session_id=abc123 |
| 290 | +├── http.status_code: 200 |
| 291 | +├── mcp.server.name: github |
| 292 | +├── mcp.server.image: ghcr.io/example/github-mcp:latest |
| 293 | +├── mcp.transport: sse |
| 294 | +├── container.id: container_abc123 |
| 295 | +└── Child Span: mcp.tools/call |
| 296 | + ├── mcp.method: tools/call |
| 297 | + ├── mcp.request.id: req_456 |
| 298 | + ├── rpc.system: jsonrpc |
| 299 | + ├── rpc.service: mcp |
| 300 | + ├── mcp.tool.name: create_issue |
| 301 | + ├── mcp.tool.arguments: {"title":"Bug report","body":"..."} |
| 302 | + ├── span.kind: client |
| 303 | + └── duration: 1.2s |
| 304 | +``` |
| 305 | + |
| 306 | +**Prometheus Metrics:** |
| 307 | +```prometheus |
| 308 | +# Request count for tools/call |
| 309 | +toolhive_mcp_requests_total{server="github",method="tools_call",status="success",transport="sse"} 15 |
| 310 | +
|
| 311 | +# Duration histogram for tools/call |
| 312 | +toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_call",le="0.5"} 8 |
| 313 | +toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_call",le="1.0"} 12 |
| 314 | +toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_call",le="2.0"} 15 |
| 315 | +toolhive_mcp_request_duration_seconds_sum{server="github",method="tools_call"} 18.5 |
| 316 | +toolhive_mcp_request_duration_seconds_count{server="github",method="tools_call"} 15 |
| 317 | +
|
| 318 | +# Tool-specific metrics |
| 319 | +toolhive_mcp_tool_calls_total{server="github",tool="create_issue",status="success"} 8 |
| 320 | +toolhive_mcp_tool_calls_total{server="github",tool="create_issue",status="error"} 2 |
| 321 | +``` |
| 322 | + |
| 323 | +**JSON-RPC Request Body (parsed for instrumentation):** |
| 324 | +```json |
| 325 | +{ |
| 326 | + "jsonrpc": "2.0", |
| 327 | + "id": "req_456", |
| 328 | + "method": "tools/call", |
| 329 | + "params": { |
| 330 | + "name": "create_issue", |
| 331 | + "arguments": { |
| 332 | + "title": "Bug report", |
| 333 | + "body": "Found an issue with the API" |
| 334 | + } |
| 335 | + } |
| 336 | +} |
| 337 | +``` |
| 338 | + |
| 339 | +This provides rich observability showing: |
| 340 | +- HTTP-level metrics (request count, duration, status) |
| 341 | +- MCP protocol details (method, tool name, request ID) |
| 342 | +- Tool-specific usage patterns |
| 343 | +- Error rates per tool |
| 344 | +- Performance characteristics of different tools |
| 345 | + |
| 346 | +### 4. Recommended Approach |
| 347 | + |
| 348 | +Start with **OTEL Collector integration** because: |
| 349 | +- Future-proof: Works with any observability backend |
| 350 | +- Scalable: Centralized metric processing |
| 351 | +- Flexible: Can add Prometheus direct export later if needed |
| 352 | +- Standard: Follows OpenTelemetry best practices |
| 353 | + |
| 354 | +The implementation would add a configuration option: |
| 355 | + |
| 356 | +```bash |
| 357 | +--otel-exporter otlp # Send to OTEL Collector (default) |
| 358 | +--otel-exporter prometheus # Direct Prometheus export |
| 359 | +--otel-exporter both # Both OTLP and Prometheus |
| 360 | +``` |
0 commit comments