Skip to content

Commit 880adcd

Browse files
committed
feat: Add OpenTelemetry integration design proposal
This proposal outlines a comprehensive approach to add OpenTelemetry observability to ToolHive's MCP server proxies through middleware-based instrumentation. Key features: - Leverages existing middleware system for clean integration - Supports both SSE and stdio transport modes - Provides traces, metrics, and structured logging - Includes MCP-specific instrumentation beyond HTTP metrics - Supports multiple OTEL backends and Prometheus integration - Maintains backward compatibility with zero performance regression The design includes detailed examples of traces and metrics for tools/call operations, showing rich observability into MCP protocol interactions. Related-to: #474 Signed-off-by: Juan Antonio Osorio <[email protected]>
1 parent 5a40447 commit 880adcd

File tree

1 file changed

+360
-0
lines changed

1 file changed

+360
-0
lines changed
Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
# OpenTelemetry Integration Design for ToolHive MCP Server Proxies
2+
3+
## Problem Statement
4+
5+
ToolHive currently lacks observability into MCP server interactions, making it difficult to:
6+
- Debug MCP protocol issues
7+
- Monitor performance and reliability
8+
- Track usage patterns and errors
9+
- Correlate issues across the proxy-container boundary
10+
11+
## Goals
12+
13+
- Add comprehensive OpenTelemetry instrumentation to MCP server proxies
14+
- Provide traces, metrics, and structured logging for all MCP interactions
15+
- Maintain backward compatibility and minimal performance impact
16+
- Support standard OTEL backends (Jaeger, Honeycomb, DataDog, etc.)
17+
18+
## Non-Goals
19+
20+
- Instrumenting MCP servers themselves (only the proxy layer)
21+
- Custom telemetry formats or proprietary backends
22+
- Breaking changes to existing APIs
23+
24+
## Architecture Overview
25+
26+
ToolHive uses HTTP proxies to front MCP servers running in containers:
27+
28+
```
29+
Client → HTTP Proxy → Container (MCP Server)
30+
31+
OTEL Middleware
32+
```
33+
34+
Two transport modes exist:
35+
1. **SSE Transport**: `TransparentProxy` forwards HTTP directly to containers
36+
2. **Stdio Transport**: `HTTPSSEProxy` bridges HTTP/SSE to container stdio
37+
38+
Both use the `types.Middleware` interface, providing a clean integration point.
39+
40+
## Detailed Design
41+
42+
### 1. Telemetry Provider (`pkg/telemetry`)
43+
44+
Create a new telemetry package that provides:
45+
46+
```go
47+
type Config struct {
48+
Enabled bool
49+
Endpoint string
50+
ServiceName string
51+
ServiceVersion string
52+
SamplingRate float64
53+
Headers map[string]string
54+
Insecure bool
55+
}
56+
57+
type Provider struct {
58+
tracerProvider trace.TracerProvider
59+
meterProvider metric.MeterProvider
60+
}
61+
62+
func NewProvider(ctx context.Context, config Config) (*Provider, error)
63+
func (p *Provider) Middleware() types.Middleware
64+
func (p *Provider) Shutdown(ctx context.Context) error
65+
```
66+
67+
The provider initializes OpenTelemetry with proper resource attribution and configures exporters for OTLP endpoints. It handles graceful shutdown and provides HTTP middleware for instrumentation.
68+
69+
### 2. HTTP Middleware Implementation
70+
71+
The middleware wraps HTTP handlers to provide comprehensive instrumentation:
72+
73+
**Request Processing:**
74+
- Extract HTTP metadata (method, URL, headers)
75+
- Start trace spans with semantic conventions
76+
- Parse request bodies for MCP protocol information
77+
- Record request metrics and active connections
78+
79+
**Response Processing:**
80+
- Capture response metadata (status, size, duration)
81+
- Record completion metrics
82+
- Finalize spans with response attributes
83+
84+
**Error Handling:**
85+
The middleware never fails the underlying request, even if telemetry operations encounter errors. All operations use timeouts and circuit breakers.
86+
87+
### 3. MCP Protocol Instrumentation
88+
89+
Enhanced instrumentation for JSON-RPC calls:
90+
91+
```go
92+
func extractMCPMethod(body []byte) (method, id string, err error)
93+
func addMCPAttributes(span trace.Span, method string, serverName string)
94+
```
95+
96+
This extracts MCP-specific information like method names (tools/list, resources/read), request IDs, and error codes to provide protocol-level observability beyond HTTP metrics.
97+
98+
### 4. Configuration Integration
99+
100+
Add CLI flags to existing commands:
101+
102+
```bash
103+
--otel-enabled # Enable OpenTelemetry
104+
--otel-endpoint # OTLP endpoint URL
105+
--otel-service-name # Service name (default: toolhive-mcp-proxy)
106+
--otel-sampling-rate # Trace sampling rate (0.0-1.0)
107+
--otel-headers # Authentication headers
108+
--otel-insecure # Disable TLS verification
109+
```
110+
111+
Environment variable support:
112+
```bash
113+
TOOLHIVE_OTEL_ENABLED=true
114+
TOOLHIVE_OTEL_ENDPOINT=https://api.honeycomb.io
115+
TOOLHIVE_OTEL_HEADERS="x-honeycomb-team=your-api-key"
116+
```
117+
118+
### 5. Integration Points
119+
120+
**Run Command Integration:**
121+
The `thv run` command creates telemetry providers when OTEL is enabled and adds the middleware to the chain alongside authentication middleware.
122+
123+
**Proxy Command Integration:**
124+
The standalone `thv proxy` command receives similar integration for proxy-only deployments.
125+
126+
**Transport Integration:**
127+
Both SSE and stdio transports automatically inherit telemetry through the middleware system.
128+
129+
## Data Model
130+
131+
### Trace Attributes
132+
133+
**HTTP Layer:**
134+
```
135+
service.name: toolhive-mcp-proxy
136+
service.version: 1.0.0
137+
http.method: POST
138+
http.url: http://localhost:8080/sse
139+
http.status_code: 200
140+
```
141+
142+
**MCP Layer:**
143+
```
144+
mcp.server.name: github
145+
mcp.server.image: ghcr.io/example/github-mcp:latest
146+
mcp.transport: sse
147+
mcp.method: tools/list
148+
mcp.request.id: 123
149+
rpc.system: jsonrpc
150+
container.id: abc123def456
151+
```
152+
153+
### Metrics
154+
155+
```
156+
# Request count
157+
toolhive_mcp_requests_total{method="POST",status_code="200",mcp_method="tools/list",server="github"}
158+
159+
# Request duration
160+
toolhive_mcp_request_duration_seconds{method="POST",mcp_method="tools/list",server="github"}
161+
162+
# Active connections
163+
toolhive_mcp_active_connections{server="github",transport="sse"}
164+
```
165+
166+
## Implementation Plan
167+
168+
### Phase 1: Core Infrastructure
169+
- Create `pkg/telemetry` package
170+
- Implement basic HTTP middleware
171+
- Add CLI flags and configuration
172+
- Integration with `run` and `proxy` commands
173+
174+
### Phase 2: MCP Protocol Support
175+
- JSON-RPC message parsing
176+
- MCP-specific span attributes
177+
- Enhanced metrics with MCP context
178+
179+
### Phase 3: Production Readiness
180+
- Performance optimization
181+
- Error handling and graceful degradation
182+
- Documentation and examples
183+
184+
### Phase 4: Advanced Features
185+
- Custom dashboards and alerts
186+
- Sampling strategies
187+
- Advanced correlation features
188+
189+
## Security Considerations
190+
191+
- **Data Sanitization**: Exclude sensitive headers and request bodies from traces
192+
- **Sampling**: Default to 10% sampling to control costs and overhead
193+
- **Authentication**: Support standard OTLP authentication headers
194+
- **Graceful Degradation**: Continue normal operation if telemetry endpoint is unavailable
195+
196+
## Implementation Benefits
197+
198+
1. **Non-Intrusive**: Leverages existing middleware system without architectural changes
199+
2. **Comprehensive Coverage**: Captures all MCP traffic through proxy instrumentation
200+
3. **Flexible Configuration**: Supports various OTEL backends
201+
4. **Production Ready**: Includes sampling, authentication, and graceful degradation
202+
5. **MCP-Aware**: Provides protocol-specific insights beyond generic HTTP metrics
203+
204+
## Success Metrics
205+
206+
- Zero performance regression in proxy throughput
207+
- Complete trace coverage for all MCP interactions
208+
- Successful integration with major OTEL backends
209+
- Positive feedback from operators on debugging capabilities
210+
211+
## Alternatives Considered
212+
213+
1. **Container-level instrumentation**: Rejected due to complexity and MCP server diversity
214+
2. **Custom telemetry format**: Rejected in favor of OTEL standards
215+
3. **Sidecar approach**: Rejected due to deployment complexity
216+
217+
The middleware-based approach is elegant because it leverages existing infrastructure while providing comprehensive observability. OpenTelemetry packages are already available as indirect dependencies, so no major dependency changes are required.
218+
219+
## Prometheus Integration
220+
221+
Prometheus can be integrated with this OpenTelemetry design through multiple pathways:
222+
223+
### 1. OTEL Collector → Prometheus (Recommended)
224+
225+
The most robust approach uses the OpenTelemetry Collector as an intermediary:
226+
227+
```
228+
ToolHive Proxy → OTEL Collector → Prometheus
229+
```
230+
231+
**How it works:**
232+
- ToolHive sends metrics via OTLP to an OTEL Collector
233+
- The collector exports metrics to Prometheus using the `prometheusexporter`
234+
- Prometheus scrapes the collector's `/metrics` endpoint
235+
236+
**Benefits:**
237+
- Centralized metric processing and transformation
238+
- Can aggregate metrics from multiple ToolHive instances
239+
- Supports metric filtering, renaming, and enrichment
240+
- Provides a buffer if Prometheus is temporarily unavailable
241+
242+
### 2. Direct Prometheus Exporter
243+
244+
ToolHive could expose metrics directly via a Prometheus endpoint:
245+
246+
```
247+
ToolHive Proxy → Prometheus (direct scrape)
248+
```
249+
250+
**Implementation:**
251+
- Add a Prometheus exporter alongside the OTLP exporter
252+
- Expose `/metrics` endpoint on each proxy instance
253+
- Configure Prometheus to scrape ToolHive instances directly
254+
255+
**Configuration Addition:**
256+
```bash
257+
--prometheus-enabled # Enable Prometheus metrics endpoint
258+
--prometheus-port 9090 # Port for /metrics endpoint
259+
--prometheus-path /metrics # Metrics endpoint path
260+
```
261+
262+
### 3. Prometheus Metric Examples
263+
264+
Metrics would follow Prometheus conventions:
265+
266+
```prometheus
267+
# Counter metrics
268+
toolhive_mcp_requests_total{server="github",method="tools_list",status="success",transport="sse"} 42
269+
270+
# Histogram metrics
271+
toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_list",le="0.1"} 10
272+
toolhive_mcp_request_duration_seconds_sum{server="github",method="tools_list"} 12.5
273+
toolhive_mcp_request_duration_seconds_count{server="github",method="tools_list"} 42
274+
275+
# Gauge metrics
276+
toolhive_mcp_active_connections{server="github",transport="sse"} 5
277+
```
278+
279+
### 4. Example: tools/call Trace and Metrics
280+
281+
Here's how a `tools/call` MCP method would appear in traces and metrics:
282+
283+
**Distributed Trace:**
284+
```
285+
Span: mcp.proxy.request
286+
├── service.name: toolhive-mcp-proxy
287+
├── service.version: 1.0.0
288+
├── http.method: POST
289+
├── http.url: http://localhost:8080/messages?session_id=abc123
290+
├── http.status_code: 200
291+
├── mcp.server.name: github
292+
├── mcp.server.image: ghcr.io/example/github-mcp:latest
293+
├── mcp.transport: sse
294+
├── container.id: container_abc123
295+
└── Child Span: mcp.tools/call
296+
├── mcp.method: tools/call
297+
├── mcp.request.id: req_456
298+
├── rpc.system: jsonrpc
299+
├── rpc.service: mcp
300+
├── mcp.tool.name: create_issue
301+
├── mcp.tool.arguments: {"title":"Bug report","body":"..."}
302+
├── span.kind: client
303+
└── duration: 1.2s
304+
```
305+
306+
**Prometheus Metrics:**
307+
```prometheus
308+
# Request count for tools/call
309+
toolhive_mcp_requests_total{server="github",method="tools_call",status="success",transport="sse"} 15
310+
311+
# Duration histogram for tools/call
312+
toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_call",le="0.5"} 8
313+
toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_call",le="1.0"} 12
314+
toolhive_mcp_request_duration_seconds_bucket{server="github",method="tools_call",le="2.0"} 15
315+
toolhive_mcp_request_duration_seconds_sum{server="github",method="tools_call"} 18.5
316+
toolhive_mcp_request_duration_seconds_count{server="github",method="tools_call"} 15
317+
318+
# Tool-specific metrics
319+
toolhive_mcp_tool_calls_total{server="github",tool="create_issue",status="success"} 8
320+
toolhive_mcp_tool_calls_total{server="github",tool="create_issue",status="error"} 2
321+
```
322+
323+
**JSON-RPC Request Body (parsed for instrumentation):**
324+
```json
325+
{
326+
"jsonrpc": "2.0",
327+
"id": "req_456",
328+
"method": "tools/call",
329+
"params": {
330+
"name": "create_issue",
331+
"arguments": {
332+
"title": "Bug report",
333+
"body": "Found an issue with the API"
334+
}
335+
}
336+
}
337+
```
338+
339+
This provides rich observability showing:
340+
- HTTP-level metrics (request count, duration, status)
341+
- MCP protocol details (method, tool name, request ID)
342+
- Tool-specific usage patterns
343+
- Error rates per tool
344+
- Performance characteristics of different tools
345+
346+
### 4. Recommended Approach
347+
348+
Start with **OTEL Collector integration** because:
349+
- Future-proof: Works with any observability backend
350+
- Scalable: Centralized metric processing
351+
- Flexible: Can add Prometheus direct export later if needed
352+
- Standard: Follows OpenTelemetry best practices
353+
354+
The implementation would add a configuration option:
355+
356+
```bash
357+
--otel-exporter otlp # Send to OTEL Collector (default)
358+
--otel-exporter prometheus # Direct Prometheus export
359+
--otel-exporter both # Both OTLP and Prometheus
360+
```

0 commit comments

Comments
 (0)