Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update otelcontribcol to latest #1391

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

tiffanny29631
Copy link
Contributor

@tiffanny29631 tiffanny29631 commented Aug 13, 2024

Updating otelcontribcol to latest to include fix of open-telemetry/opentelemetry-collector-contrib#35006 and other essential security updates in components like Prometheus exporter.

The upgrade covers a breaking change in otelcontribcol of defaulting to local host instead of 0.0.0.0, the feature is now stable and feature gate has been removed, Config Sync should explicitly set endpoint to host 0.0.0.0 to expose port for external pod access. This applies to the metric receiver opencensus, and the exporter prometheus.

This update also separates the ConfigMap for the otel-agent, distinguishing the one used by the reconcilers from the ConfigMap used by the reconciler manager and the resource group controller. As a result, only the ConfigMap for the reconcilers will include the Config Sync-related resource tags.

go/cs-upgrade-otelcontribcol-from-103

@google-oss-prow google-oss-prow bot requested a review from nan-yu August 13, 2024 23:20
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tiffanny29631. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karlkfi
Copy link
Contributor

karlkfi commented Aug 13, 2024

Why id this safe to re-apply now?

@tiffanny29631
Copy link
Contributor Author

Why id this safe to re-apply now?

Are you referring to the upstream fix or our release schedule?

@karlkfi
Copy link
Contributor

karlkfi commented Aug 14, 2024

I mean, if we reverted it before, why is it now safe to reapply? What changed?

@tiffanny29631
Copy link
Contributor Author

I mean, if we reverted it before, why is it now safe to reapply? What changed?

To ensure correct behavior, we've included a no_op_label in the metricstransform processor's label_set filter (explained in the description). This is a temporary solution until the upstream fix is available, but we need to update to the latest otelcontribcol regardless.

@sdowell
Copy link
Contributor

sdowell commented Aug 14, 2024

I mean, if we reverted it before, why is it now safe to reapply? What changed?

To ensure correct behavior, we've included a no_op_label in the metricstransform processor's label_set filter (explained in the description). This is a temporary solution until the upstream fix is available, but we need to update to the latest otelcontribcol regardless.

This isn't really a revert then, since new changes were added to the original commit. Suggest just creating a new commit instead of reapplying.

we need to update to the latest otelcontribcol regardless.

why?

@tiffanny29631
Copy link
Contributor Author

Re @sdowell While not currently anticipated, there's a possibility that future vulnerability fixes need dependency upgrades that are incompatible with current version we use (0.103). In such cases, the version upgrade process will be similar. Otelcontribcol versions beyond 0.104.0 will introduce breaking changes.

I could start a new PR.

@tiffanny29631 tiffanny29631 changed the title Reapply "Bump otelcontribcol version to v0.106.0-gke.2 (#1369)" (#1376) Update otelcontribcol to 0.106.0-gke.2 Aug 16, 2024
@tiffanny29631
Copy link
Contributor Author

Recreated the commit and message.

@tiffanny29631 tiffanny29631 changed the title Update otelcontribcol to 0.106.0-gke.2 Update otelcontribcol to latest Dec 16, 2024
@tiffanny29631 tiffanny29631 marked this pull request as draft December 16, 2024 22:33
@tiffanny29631 tiffanny29631 marked this pull request as ready for review December 17, 2024 01:46
pkg/metrics/otel.go Outdated Show resolved Hide resolved
@@ -81,6 +81,7 @@ spec:
ports:
- containerPort: 55678 # Default OpenCensus receiver port.
- containerPort: 8888 # Metrics.
- containerPort: 13133 # Health check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the otel-agent readiness probe port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. The readiness check would fail and container will be marked as notReady by k8s if this port is not specified. It has something to do with the otelcontribcol switching from 0.0.0.0 to default using localhost, even though the healthcheck configuration has specified to listen to 0.0.0.0:13133 the port still need to be specified in the manifest.

@tiffanny29631
Copy link
Contributor Author

/blocked open-telemetry/opentelemetry-collector-contrib#34844

Generated files (go.mod etc.) were removed, which would break AutoVC integration. No affect to the continuous build of the internal mirror so far.

A CI can be added to restore the generated files to a new branch, along with base image update for CVE scan purpose.

@janetkuo
Copy link
Contributor

@code-review-assist review

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request updates the otelcontribcol version and introduces several important changes to address a breaking change in the default endpoint behavior. The upgrade to v0.115.0-gke.0 defaults the opencensus receiver and prometheus exporter endpoints to localhost, requiring explicit configuration for external access. The PR description effectively communicates this change and its implications. The subsequent modifications to set the opencensus endpoint to 0.0.0.0:55678, the prometheus endpoint to 0.0.0.0:8675, and the health_check endpoint to 0.0.0.0:13133 are necessary and correctly implemented.

The introduction of the otel-agent-reconciler ConfigMap is a valuable addition. The accompanying comments explaining the rationale for the changes, including the removal of the attributes processor in the otel-agent ConfigMap and the use of no_op_label in the metric pipelines, enhance code clarity and maintainability. The comprehensive updates to the test files ensure consistency and validate the implemented changes.

To further improve this PR, I recommend the following:

  • README Update: Document the new otel-agent-reconciler ConfigMap, its purpose, and the changes to the default endpoints. This will aid users in understanding the modifications and configuring their deployments accordingly.
  • Version Bump Justification: While the PR description mentions the breaking change, it's beneficial to explicitly state the reason for the version bump to v0.115.0-gke.0 in the commit message or PR description. This helps track changes and provides valuable context for future maintainers.
  • NetworkPolicy Documentation: Add a comment to the NetworkPolicy in manifests/templates/otel-collector.yaml explaining its purpose and the ports it opens. This enhances readability and maintainability.
  • ConfigMap Clarification: Add a comment to test/kustomization/expected.yaml clarifying the relationship between the updated otel-agent-config.yaml and the otel-agent-reconciler-config.yaml.
  • Upstream Issue Link: Include a link to the upstream issue ([processor/metricstransform] Aggregation skipped when label_set is empty but not nil open-telemetry/opentelemetry-collector-contrib#34430) related to the use of no_op_label directly in the comments within pkg/metrics/otel.go. This facilitates quick access to the relevant context.

Makefile Outdated Show resolved Hide resolved
manifests/otel-agent-cm.yaml Outdated Show resolved Hide resolved
manifests/templates/otel-collector.yaml Outdated Show resolved Hide resolved
pkg/metrics/otel.go Outdated Show resolved Hide resolved
test/kustomization/expected.yaml Outdated Show resolved Hide resolved
@tiffanny29631 tiffanny29631 force-pushed the otel branch 3 times, most recently from 8e9c40b to d22faf6 Compare January 16, 2025 19:19
@tiffanny29631
Copy link
Contributor Author

/retest

This change updates otelcoontribcol to latest and modifies to fit breaking changes from 0.104.0 and 0.106.0.

Breaking change in 0.104.0 https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.104.0
Breaking change in 0.106.0 open-telemetry/opentelemetry-collector-contrib#34430

- Localhost is now the default setting, while otel-agent and otel-collector require 0.0.0.0, so the feature gate has been removed.
- The format of the environment variable was updated to meet the new syntax requirements. The otel-agent ConfigMap was split between the reconciler and controllers, ensuring that sync-related labels are only applied to reconcilers.
- A `no_op_label` has been added to ensure that the aggregation in the metricstransform processor filters on all metric labels. This is a temporary workaround until a permanent fix is implemented upstream.
Also specifying host 0.0.0.0 instead of the new default localhost.
For all receivers and health_check.

Adding networkpolicy for ingress on essential ports.
- port 13133: health_check for readiness check. Apply to otel-agent in reconciler-manager and reconcilers; Otel-collector.
- port 55678: opencensus receiver for otel-collector to receive metric from other pods

Fix test ConfigMap.
Only keep networkpolicy for otel-collector
Version 0.117.0-gke.x all have container start up issue. See http://gkecl/1242950 for more context.
@tiffanny29631
Copy link
Contributor Author

/retest-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants