Skip to content

[codex] Document Grafana and OpenTelemetry on Control Plane#352

Open
justin808 wants to merge 20 commits into
mainfrom
docs/grafana-opentelemetry
Open

[codex] Document Grafana and OpenTelemetry on Control Plane#352
justin808 wants to merge 20 commits into
mainfrom
docs/grafana-opentelemetry

Conversation

@justin808

@justin808 justin808 commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

Adds a generic Grafana/OpenTelemetry guide for Control Plane Flow.

The guide covers:

  • where OpenTelemetry complements Control Plane Grafana platform metrics
  • a recommended internal collector architecture
  • Rails app opt-in setup behind ENABLE_OPEN_TELEMETRY
  • Control Plane resource attributes to whitelist onto telemetry
  • collector workload/config shape
  • dashboard and alert starting points
  • rollout and validation checklists
  • safety notes for dashboards, alerts, production rollout, and secret handling

Also adds navigation links from README.md and docs/tips.md.

Safety

This is docs-only guidance. It does not apply templates, change Control Plane settings, save Grafana dashboards, edit alert rules, or touch any runtime environment.

The content is intentionally generic and avoids tenant-specific org names, app names, dashboard URLs, or secret values.

Validation

Ran locally:

script/check_cpln_links README.md docs/tips.md docs/grafana-opentelemetry.md

Result: exit 0.

git diff --check

Result: exit 0.

Also ran a local text scan for secret-looking terms and tenant-specific names. Matches were generic placeholders or existing docs language, not literal secret values or app/org-specific settings.

Notes

The independent auto-review helper was not run because local policy blocked exporting repository diff/code to an external AI review service. Since this is docs-only, I used local inspection plus the link and whitespace checks above.


Note

Low Risk
Documentation-only changes with no runtime, CLI, or infrastructure behavior changes.

Overview
Adds docs/grafana-opentelemetry.md, a generic playbook for app-level observability on Control Plane: how OpenTelemetry complements built-in Grafana platform metrics, an internal collector → Prometheus metrics → Grafana scrape flow, Rails opt-in behind ENABLE_OPEN_TELEMETRY, Control Plane resource attributes, collector workload/config patterns (including spanmetrics and cardinality pitfalls), plus rollout, validation, dashboard/alert starters, and safety notes.

README.md and docs/tips.md link to the new guide (Tips gets a short intro section and renumbered table of contents). docs/commands.md only updates the update-github-actions example release tag from v5.1.0 to v5.1.1.

Reviewed by Cursor Bugbot for commit 4596591. Bugbot is set up for automated code reviews on this repo. Configure here.

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@justin808, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 39 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eee9f636-f543-4496-9d45-4bf2bdef286b

📥 Commits

Reviewing files that changed from the base of the PR and between 7cbc46c and 4596591.

📒 Files selected for processing (4)
  • README.md
  • docs/commands.md
  • docs/grafana-opentelemetry.md
  • docs/tips.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/grafana-opentelemetry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Code Review: docs/grafana-opentelemetry.md

This is a solid, well-structured observability guide with good safety guardrails and a sensible rollout order. The content is actionable and appropriately scoped. A few technical accuracy issues in the code examples are worth fixing before merge — they could cause silent failures or confusion for readers who copy the snippets.

Correctness issues

1. Gem list vs. config.use mismatch (high impact)
The Gemfile section includes opentelemetry-instrumentation-active_support, opentelemetry-instrumentation-faraday, and opentelemetry-instrumentation-http, but none of these three appear in the config.use block below. A reader who copies both sections verbatim installs three gems but never activates them. Either add the config.use calls or remove the gems from the Gemfile example.

2. Generic service_name default (medium impact)
ENV.fetch("OTEL_SERVICE_NAME", "rails") defaults every Rails service to the name "rails". In a multi-workload GVC, all services would share the same service.name, making dashboards unfilterable by service. The fallback should be something like ENV.fetch("CPLN_WORKLOAD_NAME", "rails-app") or at minimum the comment should call out that this default must be overridden per workload.

3. 0ms histogram bucket (low impact)
A 0ms lower boundary in an explicit histogram is semantically meaningless — all latency observations are ≥ 0ms by definition, so this bucket will always be empty. Some OTel Collector versions also reject zero-value duration boundaries. Start from 5ms or 10ms.

4. Placeholder script paths in Validation section (low impact)
spec/path/to/open_telemetry_specs.rb, ./script/check_open_telemetry_config, and ./script/validate_open_telemetry_config are generic placeholders that do not exist in this repository. Readers may run them and be confused by the command not found errors. A note that these are illustrative names to be replaced — or using # example comments — would help.

Minor suggestions

  • config.use_all: The initializer snippet activates each instrumentation with individual config.use calls. This works, but the SDK's config.use_all auto-loads every installed instrumentation gem and is the more idiomatic shorthand. Even a brief mention would help readers who are not yet familiar with the API.

  • OTEL_SERVICE_NAME env var overlap: OTEL_SERVICE_NAME is an official OpenTelemetry spec environment variable that the Ruby SDK picks up automatically. Wrapping it in ENV.fetch and passing it to config.service_name = works, but readers may not realize the SDK already handles this env var natively. A short note avoids confusion.

What's good

  • Architecture diagram is clear and shows the internal-only collector correctly.
  • The template-promotion rule ("wait until two apps need the same shape") is practical and guards against premature abstraction.
  • Dashboard and alert checklists are genuinely useful and not obvious.
  • The cardinality safety notes are important and well placed.
  • Rollout order is cautious: non-production first, then explicit production approval.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
@justin808 justin808 requested review from borela, dzirtusss and frullah June 2, 2026 03:46
justin808 added 2 commits June 3, 2026 16:00
…metry

* origin/main:
  Bump version to 5.1.1
  Update CHANGELOG.md for 5.1.1 (#362)
  Replace maintenance mode sleep with domain polling (#337)
  Handle completed runner jobs before replicas appear
  Warn when staging image lacks commit suffix
  Address production promotion review follow-ups
  Harden production promotion image copy and readiness
  Add Control Plane Flow logo and brand icon assets (#358)
  Bump version to 5.1.0
  Update CHANGELOG.md for 5.1.0 (#357)
  Regenerate docs/commands.md for cpflow 5.0.4 (#344)
  Add shared secret grants (#354)
  Harden production promotion image copy (#355)
  Fix production promotion environment secret access (#353)
  Clarify production environment secret setup (#351)
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Overview: Documentation-only PR adding a Grafana/OpenTelemetry guide for Control Plane Rails apps. The structure and safety notes are solid. A few technical points in the Ruby example and supporting content are worth addressing before merging.


Technical issues

1. OTLP exporter endpoint construction

The code builds the endpoint as "#{collector}/v1/traces" where collector defaults to http://localhost:4318. This works in recent gem versions, but it couples the custom OPEN_TELEMETRY_COLLECTOR_ADDRESS env var to a manually-appended path. The standard approach is to pass the base URL only and let the gem append /v1/traces automatically, or rely on the standard OTEL_EXPORTER_OTLP_ENDPOINT env var instead of a custom one. If a reader sets OPEN_TELEMETRY_COLLECTOR_ADDRESS to a URL that already includes the path, they will hit a double-path error that is hard to diagnose.

2. Gemfile group omitted

The Gemfile example shows all OTel gems in the default group. This loads them unconditionally in test and development, adding boot overhead and requiring the gems to be present in all environments. A note recommending the production group (or conditional loading matching the ENABLE_OPEN_TELEMETRY guard) would reflect how teams typically handle this.

3. original_cpln_* attribute names unexplained

The resource attribute list uses an original_ prefix (original_cpln_org, original_cpln_gvc, etc.) without explaining where this convention comes from or how readers should derive these values from the CPLN env vars that workloads actually expose (CPLN_ORG, CPLN_GVC, etc.). One sentence connecting the env var names to the attribute names would prevent confusion.

4. OPEN_TELEMETRY_CONFIG_HASH purpose unexplained

The collector env var block lists OPEN_TELEMETRY_CONFIG_HASH: "<hash-of-config>" without explaining its purpose. This appears to be the Control Plane pattern for forcing a workload restart when a secret referenced by cpln://secret/ changes (since a secret content change alone does not trigger a restart). Without that context, readers will not know whether this is optional or why they need it.

5. config.use_all note missing required gem

The doc mentions config.use_all as an alternative to the explicit config.use list but does not mention that this requires the opentelemetry-instrumentation-all gem. A reader who removes the explicit list and calls use_all on the stock Gemfile from the example will get a load error.


Minor points

  • The Ruby initializer block does not mention the file it belongs in (e.g. config/initializers/open_telemetry.rb). A one-line comment or filename note would help readers know where to place the snippet.
  • The collector script names in the suggested .controlplane/open_telemetry/ layout (build_main_collector_config, check_main_collector_config, validate_main_collector_config) have no file extension. A note that these are shell scripts would reduce ambiguity.
  • The docs/commands.md version bump from v5.1.0 to v5.1.1 in an example is unrelated to the PR's stated purpose and would be cleaner in a separate commit.

What works well

  • Staged rollout order (non-production first, explicit production approval) is exactly right.
  • Safety notes on cardinality, secrets, and dashboard JSON are thorough and appropriately cautious.
  • Template guidance (wait for two apps before extracting shared templates) is a sensible default.
  • Keeping the guide free of org/app-specific values is the correct call for a shared docs repo.
  • Dashboard and alert checklists are practical and actionable.

Summary: The content is well-organized and covers the right topics. Addressing the endpoint construction note, Gemfile group guidance, original_cpln_* naming explanation, and config hash purpose would make this guide safe to follow without surprises.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review: Document Grafana and OpenTelemetry on Control Plane

Overall: Well-structured, docs-only PR. The coverage is thorough — architecture, Rails setup, collector config, rollout checklist, dashboards, alerts, and safety notes all follow sensible patterns. A few technical details are worth a second look before merging.


1. Gemfile group limits dev/test use

Placing OpenTelemetry gems inside group :production means they are unavailable in development and test environments. This prevents running the app locally with ENABLE_OPEN_TELEMETRY=true and any RSpec specs that exercise the OTel initializer.

If the intent is prod-only for cost/noise reasons, a note explaining the trade-off would help. Many teams instead use the default group (or group :default, :production) with require: false so the initializer can be exercised in all environments.

2. Port 9292 conflicts with Puma/Rack default

Port 9292 is Rack/Puma's conventional development default, so recommending it for the collector's Prometheus endpoint will cause confusion in Rails projects. The OTel Prometheus exporter defaults to 8889; other common choices are 9464 or any non-conflicting 9xxx port. Consider a different port or add a note that 9292 is only an example and should not collide with existing workloads.

3. OTTL syntax in transform/normalize needs verification

The transform snippet uses span.attributes["key"] and scope.name. In standard OTTL, inside trace_statements a context: span block is usually required, and within that context attributes are written as attributes["key"], not span.attributes["key"]. The instrumentation scope accessor is instrumentation_scope.name, not scope.name. A mismatch here silently produces no output without a startup error.


Minor Observations

  • The exclude_dimensions field in the spanmetrics connector is worth double-checking — some collector releases expose it, others do not.
  • set(span.attributes["resource.name"], span.name) sets resource.name to the span name. For OTel resource-level attributes the OTTL accessor is resource.attributes["key"].
  • Rollout step 1 says "behind ENABLE_OPEN_TELEMETRY=false" but the initializer checks == "true", so the var being absent already disables it. Wording like "without setting ENABLE_OPEN_TELEMETRY=true" is less ambiguous.

What Looks Good

  • Architecture overview and signal taxonomy (Metrics/Traces/Logs) are clear and accurate.
  • The ENABLE_OPEN_TELEMETRY guard and two-level ENV.fetch endpoint fallback are good patterns.
  • Resource attribute naming conventions (original_cpln_*) and the single-helper-module guidance are solid.
  • Rollout sequencing (collector first, app second, dashboard before alerts) is operationally sound.
  • Safety notes — no secrets in dashboard JSON and low-cardinality label discipline — are the right things to emphasize.
  • The docs/tips.md insertion and README link are consistent with the existing document index.

Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

This is a well-structured, documentation-only PR. The guide is thorough and the rollout/safety checklists are practical. A couple of technical issues in the Ruby code examples worth addressing before merging.

Bug: require: false + config.use mismatch

The Gemfile block marks every instrumentation gem with require: false, but the initializer only explicitly requires opentelemetry/sdk and opentelemetry/exporter/otlp. The subsequent config.use "OpenTelemetry::Instrumentation::Rails" (and PG, Redis, etc.) calls rely on Object.const_get under the hood — if those constants are not loaded, the configure block raises a NameError at startup.

Two ways to fix:

Option A — add explicit requires in the initializer (keeps require: false in Gemfile):

require "opentelemetry/sdk"
require "opentelemetry/exporter/otlp"
require "opentelemetry/instrumentation/rails"
require "opentelemetry/instrumentation/pg"
require "opentelemetry/instrumentation/redis"
require "opentelemetry/instrumentation/sidekiq"
require "opentelemetry/instrumentation/faraday"
require "opentelemetry/instrumentation/http"

Option B — drop require: false from the instrumentation gems (Bundler auto-loads them).

Option A is more explicit and pairs well with the env-gated initializer pattern shown in the guide.


Minor: original_ attribute prefix is non-standard and unexplained

The guide recommends original_cpln_org, original_cpln_gvc, etc., but doesn't explain why original_ was chosen. The prefix reads as though the values are copies of something else. Alternatives worth considering:

  • Dot-namespace: cpln.org, cpln.gvc, cpln.workload — consistent with OTel semantic conventions (e.g., cloud.provider, k8s.pod.name).
  • Flat: cpln_org, cpln_gvc — simpler, no confusing prefix.

If original_ is intentional (e.g., to avoid collision with a future upstream OTel CPLN convention), a brief explanatory note would help readers understand the choice.


Nits

  • The Validation section references placeholder paths (spec/open_telemetry/, ./script/check_open_telemetry_config). The inline comments say "Replace with your app's actual..." — a callout block (GitHub > [!NOTE]) would make these easier to scan.
  • Worth mentioning config.use_all as a one-liner alternative to the individual config.use calls — helpful for apps that already have all instrumentation gems in their Gemfile.

Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Well-structured guide overall — good rollout sequencing, checklists, and safety notes. A few technical issues worth addressing before merging.

Issues

1. OTLP Exporter endpoint mismatch (lines 117–126)
The code manually fetches OTEL_EXPORTER_OTLP_ENDPOINT then passes it as the endpoint: constructor arg to Exporter.new. These two paths behave differently in the Ruby OTLP exporter gem:

  • The OTEL_EXPORTER_OTLP_ENDPOINT env var is a base URL — the SDK automatically appends /v1/traces.
  • The endpoint: constructor param is treated as a full signal-specific URL in most gem versions, so the SDK does not append the path.

The result: if the env var is set to http://otel-collector:4318, the exporter will POST to http://otel-collector:4318 instead of http://otel-collector:4318/v1/traces — a 404 at startup. Simplest fix: remove the manual fetch and use Exporter.new with no endpoint: argument so the SDK reads the env var correctly, or explicitly append /v1/traces.

2. resource.name is a Datadog convention, not a standard OTel attribute (line 252)
The OTTL statement set(attributes["resource.name"], name) copies the span name into resource.name. This attribute name originates in Datadog APM and is not part of the OpenTelemetry semantic conventions. In a Prometheus/Grafana stack it won't be interpreted specially. Either document it as a custom/Datadog-derived attribute, or replace it with a standard attribute like span.name (already available) or a clearly namespaced custom key.

3. spanmetrics connector doesn't filter to root spans (lines 258–281)
The connector is named http_root_span_latency and the heading reads "Generate a root request latency metric", but there is no filter in the config — it will include all spans, not just root or HTTP server spans. Either add a filter (the transform processor already sets root_span=true and can be used as a condition on the pipeline that feeds this connector) or adjust the description to match the actual behavior.

4. original_cpln_replica is high-cardinality for metrics (lines 163–171)
original_cpln_replica is valuable for traces where high cardinality is fine. As a metric dimension it will cause a cardinality explosion (one time series per replica). The Safety Notes warn about high-cardinality labels, but don't explicitly call out replica as an example. Recommend adding a note that original_cpln_replica (and original_cpln_image/original_commit_hash) should be kept in trace/log attributes only and excluded from metric dimensions.

Minor

5. Inconsistent env var prefix in Collector Workload section (lines 195–199)
The Rails section correctly uses the standard OTEL_ prefix. The Collector Workload section introduces OPEN_TELEMETRY_COLLECTOR_RECEIVER_ENDPOINT, OPEN_TELEMETRY_CONFIG, and OPEN_TELEMETRY_CONFIG_HASH. Readers may wonder whether these are standard OTEL vars. A brief note clarifying these are custom/application-defined vars (not OTEL-standard) would reduce confusion.

6. Rollout step 1 wording (line 311)
"behind ENABLE_OPEN_TELEMETRY=false" implies explicitly setting the variable to false. Since the guard is if ENV["ENABLE_OPEN_TELEMETRY"] == "true", the safe default is simply not setting the var. Suggest: "with ENABLE_OPEN_TELEMETRY unset (disabled by default)".

Comment thread docs/grafana-opentelemetry.md Outdated
Consolidate guidance that was stated multiple times: Bundler group choice,
OTLP endpoint env vars, collector-image validation, error_mode lifecycle,
processor ordering, http.route cardinality check, and high-cardinality
safety bullets. Add a copy-pasteable app workload env block, align the
validation script names with the suggested directory structure, note that
the minimal pipeline drops child spans, and link the existing Tips (logs,
RAM/CPU alerts) and Secrets docs instead of leaving those unexplained.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Code Review

This is a well-structured, genuinely useful guide. The architecture is sound, the rollout order is thoughtful, and the safety/cardinality warnings are the right things to emphasize. A few issues worth addressing before merge are called out as inline comments below.

Summary of findings

Needs fixing before merge

  • The env example block says "Leave ENABLE_OPEN_TELEMETRY unset" in a comment but shows it set to "true" — contradictory and likely to confuse readers on first pass.
  • The service.pipelines example references zpages in service.extensions but has no corresponding top-level extensions: block. Copying the snippet verbatim produces a collector config that fails validation at startup.
  • The older-collector workaround for IsRootSpan() says "replace it with an explicit parent-span-id check" but gives no OTTL expression. A reader on a pinned older image has nothing to copy.

Worth addressing

  • The guide sets up a full 100%-sampled trace pipeline with no mention of head- or tail-based sampling. At production request rates this is a common cause of collector OOM and unexpectedly high ingestion cost. Even a short "consider sampling" note with a pointer would make the guide safer to follow.
  • error_mode: ignore in production is presented as the natural end-state. Silently swallowing OTTL expression errors can mask a broken filter that drops all spans undetected. The trade-off (startup resilience vs. silent misbehaviour) should be stated so readers make a deliberate choice.

Minor

  • Long prose line at line 200 breaks mid-sentence (word wrap artifact).

Overall the guide is in good shape — the findings above are fixable in place.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
…ing, OTTL

Resolve the open review threads on the Grafana/OpenTelemetry guide:

- Fix the contradictory app-workload env block: comment out
  ENABLE_OPEN_TELEMETRY and explain it is enabled only after the collector
  is deployed (the comment previously said "leave unset" above a "true" value).
- Note the initializer belongs in config/initializers/opentelemetry.rb.
- Warn that a stale OPEN_TELEMETRY_CONFIG_HASH silently keeps the old config,
  and show how to compute it with sha256sum.
- Give the explicit parent-span-id OTTL fallback for collectors without
  IsRootSpan().
- Spell out that error_mode: ignore can turn the root-span filter into a
  no-op that inflates metric cardinality.
- Add the top-level extensions: zpages block so service.extensions resolves.
- Call out that the pipeline is metrics-only (no trace storage), link back
  to Signals: Traces, and add a head/tail sampling pointer for high traffic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@justin808

Copy link
Copy Markdown
Member Author

Review follow-up — summary

Worked through the remaining open review threads on the OpenTelemetry guide. Pushed one commit (62ed600) and resolved 12 threads; one is left open for a decision (below).

Fixed in 62ed600 (Address OpenTelemetry guide review: env contradiction, hashing, sampling, OTTL):

  • App-workload env block no longer contradicts itself — ENABLE_OPEN_TELEMETRY is commented out with a note to enable it only after the collector is deployed (previously the comment said "leave unset" directly above a "true" value).
  • Initializer placement called out (config/initializers/opentelemetry.rb).
  • OPEN_TELEMETRY_CONFIG_HASH: warned that a stale hash silently keeps the old config running, and added a sha256sum … | awk '{print $1}' compute example.
  • Added the explicit parent-span-id OTTL fallback for collectors without IsRootSpan().
  • error_mode: ignore: spelled out that a broken condition becomes a no-op that passes every span into the spanmetrics connector and inflates cardinality with no error.
  • Added the top-level extensions: zpages block so service.extensions resolves.
  • Called out that the pipeline is metrics-only (no raw trace storage), linked back to Signals: Traces, and added a head/tail sampling pointer for high-traffic workloads.

Resolved as duplicate/stale (no code change):

  • Two status.code / error-rate comments and two "traces discarded" comments were duplicates; consolidated into the canonical threads.
  • The "app workload env example is missing" comment predates the Recommended app workload env block that already sets ENABLE_OPEN_TELEMETRY, OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, and OTEL_EXPORTER_OTLP_PROTOCOL.

Left open for a decision: the status.code thread on the spanmetrics exclude_dimensions block. The connector currently excludes status.code, but the Dashboard section lists "error count and error rate" as a panel — and you can't split a histogram by success/error without that dimension. status.code is low-cardinality (Ok/Error/Unset), so keeping it is cheap. Deciding whether to drop it from exclude_dimensions vs. add a separate error-counter connector is a design call, so I left that thread for review rather than changing the example unilaterally.

@claude

claude Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review: Grafana and OpenTelemetry Guide

Well-researched, comprehensive observability guide. The safety-first framing — collector-first deployment, non-production rollout, ENABLE_OPEN_TELEMETRY guard — is exactly right. The cardinality warnings, http.route pattern check, OPEN_TELEMETRY_CONFIG_HASH restart trick, and processor-order dependency explanation are all strong operational details that save real pain in production.

Technical Accuracy

  • Ports, gem names, and SDK API calls are all correct.
  • The IsRootSpan() version note with an explicit parent-span-id fallback is solid.
  • The filter-ordering dependency on transform/normalize is clearly explained.
  • SpanID(0x0000000000000000) (8-byte zero) is the correct OTTL literal for a root span.

Issues Found

Moderate — worth addressing before merge:

  1. status.code excluded from spanmetrics dimensions (line 368) — The dashboard section recommends an "error count and error rate" row, but excluding status.code means the generated histogram cannot distinguish errors from successes. status.code has only three values (UNSET / OK / ERROR), so it is actually one of the safest labels to keep. Without it, an error-rate PromQL query against this metric is not possible. See inline comment.

  2. Batch processor absent from the traces pipeline (lines 403–410) — The minimum-components list includes a batch processor, but the pipeline example places it only on the metrics pipeline. For any meaningful traffic volume, individual-span forwarding into the spanmetrics connector wastes CPU; the batch processor is nearly free to add on the traces side. See inline comment.

  3. error_mode: ignore recommendation without a monitoring caveat (lines 340–341) — Silently swallowing filter-expression errors in production prevents startup failures, but it also silently passes every span (including child spans) through a broken filter, inflating cardinality with no signal. A self-metrics alert on otelcol_processor_dropped_spans (or similar) should accompany this recommendation. See inline comment.

Minor:

  • Line 203 has a notably long prose line that breaks the otherwise consistent ~100-char wrapping style.
  • The original_ attribute prefix is a reasonable workaround for Prometheus sanitizing dots to underscores in label names, but the reason is never stated; readers who wonder "why not cpln.org?" won't find the answer.
  • The OTTL set(attributes["root_span"], false) where not IsRootSpan() branch is technically redundant — a span with no root_span attribute already satisfies != true for the downstream filter. Harmless, but can confuse readers new to OTTL.

Security

No concerns. The guide consistently keeps secrets out of telemetry dimensions, recommends closing external collector ingress, and uses cpln://secret/ for the collector config.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
justin808 and others added 2 commits June 13, 2026 15:32
The spanmetrics connector excluded status.code, but the Dashboard section
lists "error count and error rate" as a panel — and a histogram cannot be
split by success/error without that dimension. status.code is low-cardinality
(Ok/Error/Unset), so keeping it is cheap.

- Remove status.code from exclude_dimensions.
- Explain how to compute error rate from the request-count series split by
  status.code, with a note to confirm metric/label names per collector version.
- Note that OTel sets server span status to Error for 5xx but not 4xx, and to
  add http.response.status_code for 4xx granularity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The traces pipeline omitted the batch processor that is listed as a minimum
  component and already present on the metrics pipeline. Add it between
  filter/non_root_spans and the spanmetrics connector for consistency.
- Add an Alert Starting Point bullet: collector span throughput spiking or the
  filter's dropped-span count falling to zero is an early signal that an
  error_mode: ignore filter silently became a no-op.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@justin808 justin808 marked this pull request as ready for review June 14, 2026 03:02
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds docs/grafana-opentelemetry.md, a generic playbook for app-level observability on Control Plane, and wires navigation to it from README.md and docs/tips.md. A one-line example version bump in docs/commands.md is also included.

  • New guide covers the full OTLP → internal collector → spanmetrics → Prometheus → Grafana pipeline, including a Rails initializer gated on ENABLE_OPEN_TELEMETRY, Control Plane resource-attribute whitelisting, collector workload/config patterns, cardinality pitfalls, dashboard/alert starters, rollout checklist, and safety notes.
  • docs/tips.md inserts a short "Grafana and OpenTelemetry" section at item 9 and renumbers the TOC to 13 items; all anchors resolve correctly.
  • docs/commands.md updates the update-github-actions example tag from v5.1.0 to v5.1.1.

Confidence Score: 5/5

Documentation-only changes with no runtime, CLI, or infrastructure behavior changes — safe to merge.

All four files contain only documentation. The new guide's technical content (OTTL expressions, spanmetrics pipeline wiring, processor ordering, Rails initializer, cardinality warnings) is accurate and internally consistent. All internal anchor links resolve, referenced files exist, and the TOC renumbering in tips.md is correct. No code is executed, no configs are applied, and no runtime behavior changes.

No files require special attention.

Important Files Changed

Filename Overview
docs/grafana-opentelemetry.md New 604-line guide covering OpenTelemetry + Grafana on Control Plane: Rails initializer, collector workload config, spanmetrics pipeline, rollout checklist, and safety notes — technically accurate with no issues found.
docs/tips.md Adds a short "Grafana and OpenTelemetry" section at line 207, inserts it at TOC position 9, and renumbers items 9–13 correctly. All anchors resolve.
README.md Appends item 18 linking to the new Grafana/OpenTelemetry guide; no other changes.
docs/commands.md Updates the update-github-actions example release tag from v5.1.0 to v5.1.1 in one line of prose.

Sequence Diagram

sequenceDiagram
    participant RailsApp as Rails Workload
    participant Collector as OTel Collector Workload
    participant CPMetrics as Control Plane Metrics Scrape
    participant Grafana as Grafana

    RailsApp->>Collector: OTLP HTTP traces/logs (port 4318)
    Note over Collector: transform/normalize (set root_span attr)
    Note over Collector: filter/non_root_spans (drop child spans)
    Note over Collector: spanmetrics connector → Prometheus metrics
    CPMetrics->>Collector: GET /metrics (port 8889)
    Collector-->>CPMetrics: Prometheus metric samples
    CPMetrics->>Grafana: metrics ingestion
    Grafana-->>Grafana: "dashboards & alerts"
Loading

Reviews (1): Last reviewed commit: "Add batch to traces pipeline and an aler..." | Re-trigger Greptile

@claude

claude Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Code Review: Grafana and OpenTelemetry on Control Plane

Overall: High-quality, well-structured documentation. The guide is thorough, covers the right failure modes, and the safety/rollout guidance is especially good. A few technical issues to address before merging.


Issues

1. error_mode: ignore in production — recommend against it

The guide recommends switching the filter processor from error_mode: propagate to error_mode: ignore for production "so a later expression regression cannot fail collector startup." The tradeoff runs the wrong way: with ignore, a broken OTTL expression silently becomes a no-op and every span — including high-cardinality child spans — floods the spanmetrics connector with no error logged. That is harder to detect and recover from than a collector restart failure.

The guide's own validation checklist (run known test trace, confirm child spans are dropped) is the correct mitigation. If that validation is in CI, keeping error_mode: propagate in production is safe and gives fast, visible failure if a regression slips through. Suggest removing the recommendation to switch to ignore in production.

2. batch processor missing from traces pipeline in the service.pipelines example

The "Minimum collector config components" section lists the batch processor as required, and the surrounding prose places it in the traces pipeline. The final service.pipelines example omits - batch under the traces pipeline processors, listing it only under metrics. This inconsistency will confuse readers assembling the full config. Either add - batch to the traces pipeline or add a note explaining the intentional omission.

3. Unrelated version bump in docs/commands.md

The update-github-actions example tag changes from v5.1.0 to v5.1.1. This looks like it belongs in a separate PR or a gem-update commit — it is not related to the Grafana/OTel guide.


Minor suggestions

  • The collector config snippets are intentionally split across multiple code blocks (receivers, processors, connectors, pipelines). A one-line note that they must be merged into a single top-level config.yaml would prevent the most common beginner mistake.
  • config.use_all is an alternative to listing each instrumentation separately — worth a footnote for readers who want a quick start.
  • The comment on the config.resource= approach (lines 127–129) correctly explains why explicit resource assignment is used. A one-line clarification that the explicit resource attribute overrides the SDK's automatic reading of OTEL_SERVICE_NAME would make the reasoning complete for readers unfamiliar with OTel SDK merge order.

Summary

Blocking: items 1 and 2 — the error_mode: ignore recommendation and the batch processor inconsistency introduce real operational risk. Item 3 is a cleanup request. Everything else is non-blocking.

`error_mode: propagate` in development and staging, and include a known trace
with one root span and one child span in collector validation. Once the
expression is verified, production configs can switch to `ignore` so a later
expression regression cannot fail collector startup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommending error_mode: ignore for production runs the tradeoff backwards. With ignore, a broken OTTL expression silently becomes a no-op: every span (including child spans) passes into the spanmetrics connector with no log or alert. The resulting cardinality explosion is much harder to detect and recover from than a collector that fails to restart.

The guide already provides the right mitigation: validate the filter with a known test trace in CI. If that check is green, error_mode: propagate is safe in production and gives fast, visible failure if a regression ever slips through.

Suggest removing the recommendation to switch to ignore in production and instead linking the error_mode guidance directly to the CI validation step.

Comment thread docs/commands.md
from the currently installed cpflow gem. Use this after updating the
cpflow gem so checked-in workflow wrappers move to the matching upstream
release tag, for example `v5.1.0`.
release tag, for example `v5.1.1`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version bump from v5.1.0 to v5.1.1 is unrelated to the Grafana/OpenTelemetry documentation added by this PR. Consider moving it to a dedicated gem-update commit/PR to keep the change history clean.

processors:
- transform/normalize
- filter/non_root_spans
- batch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch processor is missing from this traces pipeline, but the "Minimum collector config components" section lists it as required and the surrounding prose describes it as part of the traces pipeline.

Without batch here, every span is forwarded to the spanmetrics connector immediately, one-by-one. At production request rates this creates unnecessary CPU and memory pressure on the connector.

Suggested change
- batch
- transform/normalize
- filter/non_root_spans
- batch

Comment on lines +383 to +388
`span.name` is excluded because raw Rails span names can carry too much
cardinality. `status.code` is kept as a dimension so the dashboard's "error
count and error rate" panel is derivable; it has only a few values
(`STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`), so the cardinality
cost is negligible. Compute error rate from the connector's request-count series
split by `status.code` (errored calls ÷ total calls), and confirm the exact

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status.code is described as "kept as a dimension," but it is not listed under dimensions: in the YAML block above (only service.name, original_cpln_workload, and http.route appear there), and it is also absent from exclude_dimensions:. Whether it is included by default depends on the specific collector-contrib version — older releases used different default dimension sets.

Consider making the intent explicit and version-independent by adding it to dimensions::

Suggested change
`span.name` is excluded because raw Rails span names can carry too much
cardinality. `status.code` is kept as a dimension so the dashboard's "error
count and error rate" panel is derivable; it has only a few values
(`STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`), so the cardinality
cost is negligible. Compute error rate from the connector's request-count series
split by `status.code` (errored calls ÷ total calls), and confirm the exact
`span.name` is excluded because raw Rails span names can carry too much
cardinality. `status.code` is explicitly listed as a dimension so the dashboard's "error
count and error rate" panel is derivable; it has only a few values
(`STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`), so the cardinality
cost is negligible. Compute error rate from the connector's request-count series

And add it to the dimensions: block:

    dimensions:
      - name: service.name
      - name: original_cpln_workload
      - name: http.route
      - name: status.code

This makes the intent clear regardless of the pinned collector version.

Comment on lines +339 to +341
with one root span and one child span in collector validation. Once the
expression is verified, production configs can switch to `ignore` so a later
expression regression cannot fail collector startup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommending error_mode: ignore in production so "a later expression regression cannot fail collector startup" trades one risk for a worse one: a broken OTTL expression silently becomes a no-op that passes every span — including child spans — into the spanmetrics connector, which can cause a sudden cardinality explosion with zero log output. The alert on line 544 ("filter's dropped-span count falling to zero") is the only safety net, but it is easy to configure this setting before that alert is in place.

Consider strengthening the guidance with a cross-reference:

Suggested change
with one root span and one child span in collector validation. Once the
expression is verified, production configs can switch to `ignore` so a later
expression regression cannot fail collector startup.
expression is verified, production configs can switch to `ignore` **only when the
alert in [Alert Starting Point](#alert-starting-point) (collector filter's dropped-span
count falling to zero) is already active and routed to a paging contact point**.
Without that alert, a future expression regression causes a silent cardinality explosion.

Comment on lines +401 to +426
```yaml
extensions:
zpages:
endpoint: "0.0.0.0:55679"

service:
extensions:
- zpages
pipelines:
traces:
receivers:
- otlp
processors:
- transform/normalize
- filter/non_root_spans
- batch
exporters:
- spanmetrics/http_root_span_latency
metrics:
receivers:
- spanmetrics/http_root_span_latency
processors:
- batch
exporters:
- prometheus
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service.pipelines block references receivers: [otlp] and exporters: [prometheus], but no receivers: or exporters: blocks are shown anywhere in the guide. The minimum components list (line 286) names both, but without a YAML snippet readers must look up the syntax separately.

Adding stubs would make the config self-contained:

receivers:
  otlp:
    protocols:
      http:
        endpoint: "${OPEN_TELEMETRY_COLLECTOR_RECEIVER_ENDPOINT}"

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

extensions:
  zpages:
    endpoint: "0.0.0.0:55679"

service:
  extensions:
    - zpages
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - transform/normalize
        - filter/non_root_spans
        - batch
      exporters:
        - spanmetrics/http_root_span_latency
    metrics:
      receivers:
        - spanmetrics/http_root_span_latency
      processors:
        - batch
      exporters:
        - prometheus

@claude

claude Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Code Review

This is a well-structured, comprehensive observability guide. The content is logically ordered, safety considerations are thorough, and several real-world pitfalls (cardinality, processor ordering, error_mode behavior) are called out proactively. Three technical issues are worth addressing before merge — see inline comments for details.

Issues (inline comments posted)

  1. status.code not explicit in the spanmetrics dimensions YAML (line 383–388): The prose says it is "kept as a dimension," but it is absent from both the dimensions: list and exclude_dimensions:. Whether it is included by default is version-dependent. Explicitly listing - name: status.code in dimensions: makes the intent clear regardless of the pinned collector image.

  2. error_mode: ignore in production needs a stronger guard (lines 339–341): Switching to ignore means a broken OTTL expression silently becomes a no-op — every span passes through into spanmetrics with zero log output, risking a cardinality explosion. The alert for "filter's dropped-span count falling to zero" (line 544) is the only safety net. The recommendation should cross-reference that alert and require it to be active before switching to ignore.

  3. OTLP receiver and Prometheus exporter config blocks are missing (lines 401–426): The service.pipelines YAML references both receivers: [otlp] and exporters: [prometheus], but neither top-level block is ever shown in the guide. Readers new to the OTel collector will need to look these up separately. Adding minimal stubs (see inline comment) would make the guide self-contained.

Minor

  • Line 203: The long inline-code run wraps awkwardly in some Markdown renderers. A soft line break before it would help.
  • docs/commands.md version bump (v5.1.0 to v5.1.1): Unrelated to the PR's stated scope. A brief mention in the description would clarify why it is included.

Strengths

  • db_statement: :obfuscate on the PG instrumentation is the right default — prevents SQL from leaking into traces.
  • The rollout order (deploy collector first, enable app telemetry second) is the correct sequence and avoids the common "app emitting spans to nothing" failure mode.
  • The cardinality warnings around http.route and high-cardinality resource attributes are exactly the right things to warn about.
  • The OPEN_TELEMETRY_CONFIG_HASH mechanism for forcing a collector restart on config change is a practical solution to a real Control Plane limitation, and documenting the staleness risk is important.

@claude

claude Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review: [codex] Document Grafana and OpenTelemetry on Control Plane

Overall: Well-structured, technically accurate guide. The cardinality warnings, processor-order explanation, rollout checklist, and safety notes are all solid. A few issues worth addressing before merging.


What's good

  • Explicit db_statement: :obfuscate for PG instrumentation — the right safe default
  • error_mode: propagate vs error_mode: ignore trade-off is discussed accurately and in depth
  • OPEN_TELEMETRY_CONFIG_HASH pattern for forcing workload restarts on config change is a smart operational detail
  • Cardinality footguns (http.route pattern vs raw path, high-cardinality attributes in metric labels) are the right things to call out prominently
  • Processor order requirement (transform before filter) is clearly explained with consequences
  • Rollout order is sequential and safe; staging-before-production gate is explicit

Issues

Security — Redis data sanitization

The PG instrumentation is configured with db_statement: :obfuscate, which prevents SQL (including parameter values) from appearing in traces. The Redis instrumentation has no equivalent option set. Redis values can hold session tokens, user-identifying payloads, or cache content. The opentelemetry-instrumentation-redis gem supports db_statement: :obfuscate (same option key). Consider adding it or explicitly calling it out as a team decision.

Reliability — error_mode: ignore in production

The guide recommends switching the filter processor to error_mode: ignore in production to prevent collector startup failure on expression regression. The stated mitigation is monitoring the filter's dropped-span count for a silent no-op. That mitigation is reasonable — but it depends on the team actually setting up and maintaining that alert reliably. If the alert lapses, a broken OTTL condition silently passes every span into the spanmetrics connector, inflating cardinality with no signal. Consider instead keeping error_mode: propagate everywhere and relying on a well-tested OTTL expression rather than suppressing errors. Collector startup failure is a loud, obvious signal; silent cardinality inflation is not.

Clarity — status.code dimension is implicit

The prose says "status.code is kept as a dimension" but the connector config does not list it in dimensions:. It survives because it is a spanmetrics default and is absent from exclude_dimensions:, but this is subtle and version-dependent. A reader who accidentally adds it to exclude_dimensions (or who uses an older collector image where defaults differ) will only notice the error rate panel is broken at query time. Making status.code explicit in dimensions: — alongside service.name, original_cpln_workload, and http.route — removes the ambiguity.


Minor

  • Line 203 has an overlong prose line (~117 chars) relative to the rest of the file's wrapping style.
  • The sampling recommendation (head-based or tail-based for high-traffic workloads) appears at the very end of the Collector Config section. This is operationally critical and easy to miss; consider adding a bullet to the Rollout Order or Validation checklist so teams don't skip it.
  • docs/commands.md: the version bump from v5.1.0v5.1.1 in the example is fine, but it's unrelated to the PR's stated scope. Worth noting in case it was accidental.


config.use "OpenTelemetry::Instrumentation::Rails"
config.use "OpenTelemetry::Instrumentation::PG", { db_statement: :obfuscate }
config.use "OpenTelemetry::Instrumentation::Redis"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PG instrumentation above uses db_statement: :obfuscate to prevent SQL parameter values from leaking into traces, but Redis gets no equivalent option. Redis values can hold session tokens, user-identifying keys, or cached payloads.

The opentelemetry-instrumentation-redis gem supports the same option:

Suggested change
config.use "OpenTelemetry::Instrumentation::Redis"
config.use "OpenTelemetry::Instrumentation::Redis", { db_statement: :obfuscate }

If the team decides raw Redis commands are acceptable in traces (e.g. keys are never sensitive), add a comment explaining that decision so future readers don't add obfuscation by mistake.

Comment on lines +339 to +341
with one root span and one child span in collector validation. Once the
expression is verified, production configs can switch to `ignore` so a later
expression regression cannot fail collector startup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recommendation deserves a stronger caveat. Switching to error_mode: ignore means a future OTTL regression (typo, collector-contrib API change, etc.) will silently pass every span — including all child spans — through the filter into the spanmetrics connector. Cardinality explodes with no error log and no alert unless the team has the dropped-span monitor in place and working.

The trade-off framing here is: loud startup failure vs. silent cardinality storm. Startup failure is immediately visible in the workload health dashboard; cardinality inflation is not. Consider recommending propagate everywhere and instead investing in a thorough OTTL expression test in the collector validation script, which the guide already calls for. That makes the expression regression-safe without needing the monitor as a backstop.

Comment on lines +384 to +385
cardinality. `status.code` is kept as a dimension so the dashboard's "error
count and error rate" panel is derivable; it has only a few values

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status.code survives here because it is a spanmetrics connector default dimension and is absent from exclude_dimensions, but that implicit dependency is version-sensitive. Readers who study the config example may not realise it's there, and someone adding it to exclude_dimensions later will silently break the error-rate panel.

Consider listing it explicitly in dimensions: to make the intent self-documenting:

Suggested change
cardinality. `status.code` is kept as a dimension so the dashboard's "error
count and error rate" panel is derivable; it has only a few values
dimensions:
- name: service.name
- name: status.code
- name: original_cpln_workload
- name: http.route

This also makes the "confirm the exact metric and label names against your collector image" advice below easier to act on, since the reader can see exactly which labels to verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant