Skip to content

feat: Add OpenTelemetry integration design proposal #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Jun 3, 2025

This proposal outlines a comprehensive approach to add OpenTelemetry observability to ToolHive's MCP server proxies through middleware-based instrumentation.

Key Features

  • Leverages existing middleware system for clean integration
  • Supports both SSE and stdio transport modes
  • Provides traces, metrics, and structured logging
  • Includes MCP-specific instrumentation beyond HTTP metrics
  • Supports multiple OTEL backends and Prometheus integration
  • Maintains backward compatibility with zero performance regression

What's Included

  • Detailed technical design with implementation phases
  • Configuration options and CLI integration
  • Data model examples for traces and metrics
  • Concrete example of tools/call instrumentation
  • Prometheus integration pathways
  • Security considerations and production readiness

The design follows KISS and DRY principles by leveraging existing patterns and infrastructure, making it maintainable and extensible for future observability needs.

Related-to: #474

This proposal outlines a comprehensive approach to add OpenTelemetry
observability to ToolHive's MCP server proxies through middleware-based
instrumentation.

Key features:
- Leverages existing middleware system for clean integration
- Supports both SSE and stdio transport modes
- Provides traces, metrics, and structured logging
- Includes MCP-specific instrumentation beyond HTTP metrics
- Supports multiple OTEL backends and Prometheus integration
- Maintains backward compatibility with zero performance regression

The design includes detailed examples of traces and metrics for
tools/call operations, showing rich observability into MCP protocol
interactions.

Related-to: #474
Signed-off-by: Juan Antonio Osorio <[email protected]>
@JAORMX JAORMX force-pushed the feat/otel-integration-proposal branch from 6fabc17 to 2ef2ba2 Compare June 3, 2025 13:27
@JAORMX JAORMX requested review from yrobla, dmjb, lujunsan and amirejaz June 4, 2025 03:09
## Non-Goals

- Instrumenting MCP servers themselves (only the proxy layer)
- Custom telemetry formats or proprietary backends

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JAORMX @lujunsan While proprietary backends are not supported, a custom OTel Collector can accept OTLP traces and export them to a proprietary backend. I would like to confirm that this solution can be configured to send these traces to a custom OTel collector.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! as long as they're OTLP traces and not something custom, it should be fine. a custom collector is fine, it's the format I referred to.

Environment variable support:
```bash
TOOLHIVE_OTEL_ENABLED=true
TOOLHIVE_OTEL_ENDPOINT=https://api.honeycomb.io

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JAORMX @lujunsan Is this the variable we can set to send traces to a custom OTel Collector endpoint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! that's the idea; or a command line flag for this would be good too.


Here's how a `tools/call` MCP method would appear in traces and metrics:

**Distributed Trace:**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JAORMX @lujunsan I assume this distributed trace isn't limited to the Prometheus integration; my understanding is that the OTel Collector can process the span in this structure regardless of the chosen exporter. Is that correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be limited, no. This is correct.

├── http.method: POST
├── http.url: http://localhost:8080/messages?session_id=abc123
├── http.status_code: 200
├── mcp.server.name: github

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JAORMX @lujunsan Including user.hostname would be beneficial, as it would help us identify which servers are most frequently used by a specific team.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's a good point! I can make that configurable. I think we don't want to add that by default because of things like GDPR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants