[codex] Document Grafana and OpenTelemetry on Control Plane#352
[codex] Document Grafana and OpenTelemetry on Control Plane#352justin808 wants to merge 20 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 24 minutes and 39 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: docs/grafana-opentelemetry.mdThis is a solid, well-structured observability guide with good safety guardrails and a sensible rollout order. The content is actionable and appropriately scoped. A few technical accuracy issues in the code examples are worth fixing before merge — they could cause silent failures or confusion for readers who copy the snippets. Correctness issues1. Gem list vs. 2. Generic 3. 4. Placeholder script paths in Validation section (low impact) Minor suggestions
What's good
|
…metry * origin/main: Bump version to 5.1.1 Update CHANGELOG.md for 5.1.1 (#362) Replace maintenance mode sleep with domain polling (#337) Handle completed runner jobs before replicas appear Warn when staging image lacks commit suffix Address production promotion review follow-ups Harden production promotion image copy and readiness Add Control Plane Flow logo and brand icon assets (#358) Bump version to 5.1.0 Update CHANGELOG.md for 5.1.0 (#357) Regenerate docs/commands.md for cpflow 5.0.4 (#344) Add shared secret grants (#354) Harden production promotion image copy (#355) Fix production promotion environment secret access (#353) Clarify production environment secret setup (#351)
Code ReviewOverview: Documentation-only PR adding a Grafana/OpenTelemetry guide for Control Plane Rails apps. The structure and safety notes are solid. A few technical points in the Ruby example and supporting content are worth addressing before merging. Technical issues1. OTLP exporter endpoint construction The code builds the endpoint as 2. Gemfile group omitted The Gemfile example shows all OTel gems in the default group. This loads them unconditionally in test and development, adding boot overhead and requiring the gems to be present in all environments. A note recommending the 3. The resource attribute list uses an 4. The collector env var block lists 5. The doc mentions Minor points
What works well
Summary: The content is well-organized and covers the right topics. Addressing the endpoint construction note, Gemfile group guidance, |
|
Code Review: Document Grafana and OpenTelemetry on Control Plane Overall: Well-structured, docs-only PR. The coverage is thorough — architecture, Rails setup, collector config, rollout checklist, dashboards, alerts, and safety notes all follow sensible patterns. A few technical details are worth a second look before merging. 1. Gemfile group limits dev/test use Placing OpenTelemetry gems inside If the intent is prod-only for cost/noise reasons, a note explaining the trade-off would help. Many teams instead use the default group (or 2. Port 9292 conflicts with Puma/Rack default Port 9292 is Rack/Puma's conventional development default, so recommending it for the collector's Prometheus endpoint will cause confusion in Rails projects. The OTel Prometheus exporter defaults to 8889; other common choices are 9464 or any non-conflicting 9xxx port. Consider a different port or add a note that 9292 is only an example and should not collide with existing workloads. 3. OTTL syntax in transform/normalize needs verification The transform snippet uses Minor Observations
What Looks Good
|
Code ReviewThis is a well-structured, documentation-only PR. The guide is thorough and the rollout/safety checklists are practical. A couple of technical issues in the Ruby code examples worth addressing before merging. Bug:
|
Code ReviewWell-structured guide overall — good rollout sequencing, checklists, and safety notes. A few technical issues worth addressing before merging. Issues1. OTLP Exporter endpoint mismatch (lines 117–126)
The result: if the env var is set to 2. 3. 4. Minor5. Inconsistent env var prefix in Collector Workload section (lines 195–199) 6. Rollout step 1 wording (line 311) |
Consolidate guidance that was stated multiple times: Bundler group choice, OTLP endpoint env vars, collector-image validation, error_mode lifecycle, processor ordering, http.route cardinality check, and high-cardinality safety bullets. Add a copy-pasteable app workload env block, align the validation script names with the suggested directory structure, note that the minimal pipeline drops child spans, and link the existing Tips (logs, RAM/CPU alerts) and Secrets docs instead of leaving those unexplained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code ReviewThis is a well-structured, genuinely useful guide. The architecture is sound, the rollout order is thoughtful, and the safety/cardinality warnings are the right things to emphasize. A few issues worth addressing before merge are called out as inline comments below. Summary of findingsNeeds fixing before merge
Worth addressing
Minor
Overall the guide is in good shape — the findings above are fixable in place. |
…ing, OTTL Resolve the open review threads on the Grafana/OpenTelemetry guide: - Fix the contradictory app-workload env block: comment out ENABLE_OPEN_TELEMETRY and explain it is enabled only after the collector is deployed (the comment previously said "leave unset" above a "true" value). - Note the initializer belongs in config/initializers/opentelemetry.rb. - Warn that a stale OPEN_TELEMETRY_CONFIG_HASH silently keeps the old config, and show how to compute it with sha256sum. - Give the explicit parent-span-id OTTL fallback for collectors without IsRootSpan(). - Spell out that error_mode: ignore can turn the root-span filter into a no-op that inflates metric cardinality. - Add the top-level extensions: zpages block so service.extensions resolves. - Call out that the pipeline is metrics-only (no trace storage), link back to Signals: Traces, and add a head/tail sampling pointer for high traffic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review follow-up — summaryWorked through the remaining open review threads on the OpenTelemetry guide. Pushed one commit (62ed600) and resolved 12 threads; one is left open for a decision (below). Fixed in 62ed600 (
Resolved as duplicate/stale (no code change):
Left open for a decision: the |
Review: Grafana and OpenTelemetry GuideWell-researched, comprehensive observability guide. The safety-first framing — collector-first deployment, non-production rollout, Technical Accuracy
Issues FoundModerate — worth addressing before merge:
Minor:
SecurityNo concerns. The guide consistently keeps secrets out of telemetry dimensions, recommends closing external collector ingress, and uses |
The spanmetrics connector excluded status.code, but the Dashboard section lists "error count and error rate" as a panel — and a histogram cannot be split by success/error without that dimension. status.code is low-cardinality (Ok/Error/Unset), so keeping it is cheap. - Remove status.code from exclude_dimensions. - Explain how to compute error rate from the request-count series split by status.code, with a note to confirm metric/label names per collector version. - Note that OTel sets server span status to Error for 5xx but not 4xx, and to add http.response.status_code for 4xx granularity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The traces pipeline omitted the batch processor that is listed as a minimum component and already present on the metrics pipeline. Add it between filter/non_root_spans and the spanmetrics connector for consistency. - Add an Alert Starting Point bullet: collector span throughput spiking or the filter's dropped-span count falling to zero is an early signal that an error_mode: ignore filter silently became a no-op. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Greptile SummaryThis PR adds
Confidence Score: 5/5Documentation-only changes with no runtime, CLI, or infrastructure behavior changes — safe to merge. All four files contain only documentation. The new guide's technical content (OTTL expressions, spanmetrics pipeline wiring, processor ordering, Rails initializer, cardinality warnings) is accurate and internally consistent. All internal anchor links resolve, referenced files exist, and the TOC renumbering in tips.md is correct. No code is executed, no configs are applied, and no runtime behavior changes. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant RailsApp as Rails Workload
participant Collector as OTel Collector Workload
participant CPMetrics as Control Plane Metrics Scrape
participant Grafana as Grafana
RailsApp->>Collector: OTLP HTTP traces/logs (port 4318)
Note over Collector: transform/normalize (set root_span attr)
Note over Collector: filter/non_root_spans (drop child spans)
Note over Collector: spanmetrics connector → Prometheus metrics
CPMetrics->>Collector: GET /metrics (port 8889)
Collector-->>CPMetrics: Prometheus metric samples
CPMetrics->>Grafana: metrics ingestion
Grafana-->>Grafana: "dashboards & alerts"
Reviews (1): Last reviewed commit: "Add batch to traces pipeline and an aler..." | Re-trigger Greptile |
Code Review: Grafana and OpenTelemetry on Control PlaneOverall: High-quality, well-structured documentation. The guide is thorough, covers the right failure modes, and the safety/rollout guidance is especially good. A few technical issues to address before merging. Issues1. The guide recommends switching the filter processor from The guide's own validation checklist (run known test trace, confirm child spans are dropped) is the correct mitigation. If that validation is in CI, keeping 2. The "Minimum collector config components" section lists the batch processor as required, and the surrounding prose places it in the traces pipeline. The final 3. Unrelated version bump in The Minor suggestions
SummaryBlocking: items 1 and 2 — the |
| `error_mode: propagate` in development and staging, and include a known trace | ||
| with one root span and one child span in collector validation. Once the | ||
| expression is verified, production configs can switch to `ignore` so a later | ||
| expression regression cannot fail collector startup. |
There was a problem hiding this comment.
Recommending error_mode: ignore for production runs the tradeoff backwards. With ignore, a broken OTTL expression silently becomes a no-op: every span (including child spans) passes into the spanmetrics connector with no log or alert. The resulting cardinality explosion is much harder to detect and recover from than a collector that fails to restart.
The guide already provides the right mitigation: validate the filter with a known test trace in CI. If that check is green, error_mode: propagate is safe in production and gives fast, visible failure if a regression ever slips through.
Suggest removing the recommendation to switch to ignore in production and instead linking the error_mode guidance directly to the CI validation step.
| from the currently installed cpflow gem. Use this after updating the | ||
| cpflow gem so checked-in workflow wrappers move to the matching upstream | ||
| release tag, for example `v5.1.0`. | ||
| release tag, for example `v5.1.1`. |
There was a problem hiding this comment.
This version bump from v5.1.0 to v5.1.1 is unrelated to the Grafana/OpenTelemetry documentation added by this PR. Consider moving it to a dedicated gem-update commit/PR to keep the change history clean.
| processors: | ||
| - transform/normalize | ||
| - filter/non_root_spans | ||
| - batch |
There was a problem hiding this comment.
The batch processor is missing from this traces pipeline, but the "Minimum collector config components" section lists it as required and the surrounding prose describes it as part of the traces pipeline.
Without batch here, every span is forwarded to the spanmetrics connector immediately, one-by-one. At production request rates this creates unnecessary CPU and memory pressure on the connector.
| - batch | |
| - transform/normalize | |
| - filter/non_root_spans | |
| - batch |
| `span.name` is excluded because raw Rails span names can carry too much | ||
| cardinality. `status.code` is kept as a dimension so the dashboard's "error | ||
| count and error rate" panel is derivable; it has only a few values | ||
| (`STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`), so the cardinality | ||
| cost is negligible. Compute error rate from the connector's request-count series | ||
| split by `status.code` (errored calls ÷ total calls), and confirm the exact |
There was a problem hiding this comment.
status.code is described as "kept as a dimension," but it is not listed under dimensions: in the YAML block above (only service.name, original_cpln_workload, and http.route appear there), and it is also absent from exclude_dimensions:. Whether it is included by default depends on the specific collector-contrib version — older releases used different default dimension sets.
Consider making the intent explicit and version-independent by adding it to dimensions::
| `span.name` is excluded because raw Rails span names can carry too much | |
| cardinality. `status.code` is kept as a dimension so the dashboard's "error | |
| count and error rate" panel is derivable; it has only a few values | |
| (`STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`), so the cardinality | |
| cost is negligible. Compute error rate from the connector's request-count series | |
| split by `status.code` (errored calls ÷ total calls), and confirm the exact | |
| `span.name` is excluded because raw Rails span names can carry too much | |
| cardinality. `status.code` is explicitly listed as a dimension so the dashboard's "error | |
| count and error rate" panel is derivable; it has only a few values | |
| (`STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`), so the cardinality | |
| cost is negligible. Compute error rate from the connector's request-count series |
And add it to the dimensions: block:
dimensions:
- name: service.name
- name: original_cpln_workload
- name: http.route
- name: status.codeThis makes the intent clear regardless of the pinned collector version.
| with one root span and one child span in collector validation. Once the | ||
| expression is verified, production configs can switch to `ignore` so a later | ||
| expression regression cannot fail collector startup. |
There was a problem hiding this comment.
Recommending error_mode: ignore in production so "a later expression regression cannot fail collector startup" trades one risk for a worse one: a broken OTTL expression silently becomes a no-op that passes every span — including child spans — into the spanmetrics connector, which can cause a sudden cardinality explosion with zero log output. The alert on line 544 ("filter's dropped-span count falling to zero") is the only safety net, but it is easy to configure this setting before that alert is in place.
Consider strengthening the guidance with a cross-reference:
| with one root span and one child span in collector validation. Once the | |
| expression is verified, production configs can switch to `ignore` so a later | |
| expression regression cannot fail collector startup. | |
| expression is verified, production configs can switch to `ignore` **only when the | |
| alert in [Alert Starting Point](#alert-starting-point) (collector filter's dropped-span | |
| count falling to zero) is already active and routed to a paging contact point**. | |
| Without that alert, a future expression regression causes a silent cardinality explosion. |
| ```yaml | ||
| extensions: | ||
| zpages: | ||
| endpoint: "0.0.0.0:55679" | ||
|
|
||
| service: | ||
| extensions: | ||
| - zpages | ||
| pipelines: | ||
| traces: | ||
| receivers: | ||
| - otlp | ||
| processors: | ||
| - transform/normalize | ||
| - filter/non_root_spans | ||
| - batch | ||
| exporters: | ||
| - spanmetrics/http_root_span_latency | ||
| metrics: | ||
| receivers: | ||
| - spanmetrics/http_root_span_latency | ||
| processors: | ||
| - batch | ||
| exporters: | ||
| - prometheus | ||
| ``` |
There was a problem hiding this comment.
The service.pipelines block references receivers: [otlp] and exporters: [prometheus], but no receivers: or exporters: blocks are shown anywhere in the guide. The minimum components list (line 286) names both, but without a YAML snippet readers must look up the syntax separately.
Adding stubs would make the config self-contained:
receivers:
otlp:
protocols:
http:
endpoint: "${OPEN_TELEMETRY_COLLECTOR_RECEIVER_ENDPOINT}"
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
extensions:
zpages:
endpoint: "0.0.0.0:55679"
service:
extensions:
- zpages
pipelines:
traces:
receivers:
- otlp
processors:
- transform/normalize
- filter/non_root_spans
- batch
exporters:
- spanmetrics/http_root_span_latency
metrics:
receivers:
- spanmetrics/http_root_span_latency
processors:
- batch
exporters:
- prometheus
Code ReviewThis is a well-structured, comprehensive observability guide. The content is logically ordered, safety considerations are thorough, and several real-world pitfalls (cardinality, processor ordering, Issues (inline comments posted)
Minor
Strengths
|
Review: [codex] Document Grafana and OpenTelemetry on Control PlaneOverall: Well-structured, technically accurate guide. The cardinality warnings, processor-order explanation, rollout checklist, and safety notes are all solid. A few issues worth addressing before merging. What's good
IssuesSecurity — Redis data sanitization The PG instrumentation is configured with Reliability — The guide recommends switching the filter processor to Clarity — The prose says " Minor
|
|
|
||
| config.use "OpenTelemetry::Instrumentation::Rails" | ||
| config.use "OpenTelemetry::Instrumentation::PG", { db_statement: :obfuscate } | ||
| config.use "OpenTelemetry::Instrumentation::Redis" |
There was a problem hiding this comment.
The PG instrumentation above uses db_statement: :obfuscate to prevent SQL parameter values from leaking into traces, but Redis gets no equivalent option. Redis values can hold session tokens, user-identifying keys, or cached payloads.
The opentelemetry-instrumentation-redis gem supports the same option:
| config.use "OpenTelemetry::Instrumentation::Redis" | |
| config.use "OpenTelemetry::Instrumentation::Redis", { db_statement: :obfuscate } |
If the team decides raw Redis commands are acceptable in traces (e.g. keys are never sensitive), add a comment explaining that decision so future readers don't add obfuscation by mistake.
| with one root span and one child span in collector validation. Once the | ||
| expression is verified, production configs can switch to `ignore` so a later | ||
| expression regression cannot fail collector startup. |
There was a problem hiding this comment.
This recommendation deserves a stronger caveat. Switching to error_mode: ignore means a future OTTL regression (typo, collector-contrib API change, etc.) will silently pass every span — including all child spans — through the filter into the spanmetrics connector. Cardinality explodes with no error log and no alert unless the team has the dropped-span monitor in place and working.
The trade-off framing here is: loud startup failure vs. silent cardinality storm. Startup failure is immediately visible in the workload health dashboard; cardinality inflation is not. Consider recommending propagate everywhere and instead investing in a thorough OTTL expression test in the collector validation script, which the guide already calls for. That makes the expression regression-safe without needing the monitor as a backstop.
| cardinality. `status.code` is kept as a dimension so the dashboard's "error | ||
| count and error rate" panel is derivable; it has only a few values |
There was a problem hiding this comment.
status.code survives here because it is a spanmetrics connector default dimension and is absent from exclude_dimensions, but that implicit dependency is version-sensitive. Readers who study the config example may not realise it's there, and someone adding it to exclude_dimensions later will silently break the error-rate panel.
Consider listing it explicitly in dimensions: to make the intent self-documenting:
| cardinality. `status.code` is kept as a dimension so the dashboard's "error | |
| count and error rate" panel is derivable; it has only a few values | |
| dimensions: | |
| - name: service.name | |
| - name: status.code | |
| - name: original_cpln_workload | |
| - name: http.route |
This also makes the "confirm the exact metric and label names against your collector image" advice below easier to act on, since the reader can see exactly which labels to verify.
Summary
Adds a generic Grafana/OpenTelemetry guide for Control Plane Flow.
The guide covers:
ENABLE_OPEN_TELEMETRYAlso adds navigation links from
README.mdanddocs/tips.md.Safety
This is docs-only guidance. It does not apply templates, change Control Plane settings, save Grafana dashboards, edit alert rules, or touch any runtime environment.
The content is intentionally generic and avoids tenant-specific org names, app names, dashboard URLs, or secret values.
Validation
Ran locally:
Result: exit 0.
Result: exit 0.
Also ran a local text scan for secret-looking terms and tenant-specific names. Matches were generic placeholders or existing docs language, not literal secret values or app/org-specific settings.
Notes
The independent auto-review helper was not run because local policy blocked exporting repository diff/code to an external AI review service. Since this is docs-only, I used local inspection plus the link and whitespace checks above.
Note
Low Risk
Documentation-only changes with no runtime, CLI, or infrastructure behavior changes.
Overview
Adds
docs/grafana-opentelemetry.md, a generic playbook for app-level observability on Control Plane: how OpenTelemetry complements built-in Grafana platform metrics, an internal collector → Prometheus metrics → Grafana scrape flow, Rails opt-in behindENABLE_OPEN_TELEMETRY, Control Plane resource attributes, collector workload/config patterns (including spanmetrics and cardinality pitfalls), plus rollout, validation, dashboard/alert starters, and safety notes.README.mdanddocs/tips.mdlink to the new guide (Tips gets a short intro section and renumbered table of contents).docs/commands.mdonly updates theupdate-github-actionsexample release tag fromv5.1.0tov5.1.1.Reviewed by Cursor Bugbot for commit 4596591. Bugbot is set up for automated code reviews on this repo. Configure here.