Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InitContainer Ordering Issue with CloudSQL Operator 1.6.0 in Istio-Managed Environments #641

Open
utmaks opened this issue Nov 26, 2024 · 9 comments
Assignees

Comments

@utmaks
Copy link

utmaks commented Nov 26, 2024

Expected Behavior

Right order for initContainers / integration with Istio.

Actual Behavior

Description:
During the upgrade of CloudSQL Operator to version 1.6.0, issues were encountered when deploying new instances of the
application. While the existing application instances continued functioning as expected, attempts to start new pods
failed due to critical container startup errors, which would be unacceptable in a production environment.

Details:
After upgrading to version 1.6.0 to utilize the minSigtermDelay parameter, the following problem was observed:

  • New instances of the application failed to initialize due to startup probe errors from the CloudSQL container:
    Startup probe failed: Get "http://10.210.0.89:15020/app-health/csql-apps-apps/startupz": dial tcp 10.210.0.89:15020: connect: connection refused.
  • Existing pods already in operation continued to function correctly, with no observed disruptions in their database
    interactions.

The issue seems tied to Istio sidecar injection. Specifically:

  • New pods attempted to start the istio-init container but failed before Istio was fully initialized.
  • The behavior differs from previous versions of the operator, which allowed greater compatibility in Istio-managed
    environments.

Resolution Attempt:
Disabling Istio temporarily resolved the problem, allowing the CloudSQL container to start correctly in the new pods.
However, this is not an acceptable solution for production workloads where Istio is required.

Proposed Fix for CloudSQL Operator team:
Reintroduce
user-configurable options for sidecar compatibility (e.g., sidecarType), as seen in earlier commits, to
prevent such failures in Istio environments and ensure robust behavior during deployment in production scenarios.

Steps to Reproduce the Problem

  1. Install Istio (1.19.3)
  2. Install Operator
  3. Create a Deployment with CloudSQL annotation in namespace managed by Istio

Specifications

  • Version: 1.6.0
  • Platform: v1.30.5-gke.1443001
  • Istio version: 1.19.3
@jackwotherspoon
Copy link
Collaborator

Thanks for this @utmaks 👏

@hessjcg is OOO this week but will take a look next week when he is back

@hessjcg
Copy link
Collaborator

hessjcg commented Dec 2, 2024

@utmaks,

This seems like a good reason to add a configuration parameter to the AuthProxyWorkload CRD to allow you and others to make the proxy run as a PodSpec.Container container instead of a PodSpec.InitContainer sidecar container.

I am curious to understand how Istio 1.19.3 and Operator 1.6.0 conspire to create pods that won't start. Would you mind posting the pod declaration for one of these failing pods (redacted, of course)?

I wonder if Istio adding its sidecar container to PodSpec.Container or PodSpec.InitContainer? What is the order of the containers, and is Istio starting first? Are there issues with the health checks?

@azunna1
Copy link

azunna1 commented Dec 23, 2024

Using these annotations on the pod template fixes the issue - sidecar.istio.io/rewriteAppHTTPProbers: 'false', traffic.sidecar.istio.io/excludeInboundPorts: '9801'. The problem is that the istio-validation init container always starts first.
If you have strict mTls mode enabled, healthchecks may fail due to the startup probe rewrite not being present.

@azunna1
Copy link

azunna1 commented Dec 23, 2024

For a more permanent fix, the csql sidecar containers should run as user 1337 as stated here: https://cloud.google.com/knowledge/kb/pod-are-fail-to-start-due-to-init-containers-not-starting-with-istio-cni-enabled-000007358
I tried using the AuthProxyWorkload's AuthProxyWorkloadSpec container field to set it but this overrides the default configuration. Would be nice if it did a merge instead. @hessjcg

@azunna1
Copy link

azunna1 commented Jan 28, 2025

Hi @hessjcg just wanted to check for updates on this?

@hessjcg
Copy link
Collaborator

hessjcg commented Jan 31, 2025

Hi @azunna1,

Thank you for proposing a solution. I'm glad to hear that fixing the sidecar is as simple as setting the user ID on the container. I would like to update the operator to detect Istio and set user id '1337'. Do you have any suggestions on how the operator can query for istio?

I think that an istio-specific fix would be better than allowing any free-form modification to the proxy container by merging the Container field. While merging container specs would enable a convenient hacks like this, it could also make unexpected errors more likely when you upgrade the operator in the future.

@azunna1
Copy link

azunna1 commented Feb 6, 2025

I believe to detect istio, you can check for the presence of some annotations in the pod such as istio.io/rev or labels such as sidecar.istio.io/inject=true and service.istio.io/canonical-name. Also istio sidecars have the container name istio-proxy. Not sure if this would be a useful approach for the cloudsql proxy mutating webhook.

@azunna1
Copy link

azunna1 commented Feb 13, 2025

Hi @hessjcg, would you like some assistance with this? My Golang is a bit rough but i could give this issue a try with some guidance. We use istio in our environment and i'm avoiding making an update to the versions that make use of the initContainer sidecar because it would break all our deployments.

@hessjcg
Copy link
Collaborator

hessjcg commented Feb 13, 2025

@azunna1 By all means, give it a try! Here are some pointers to get you started.

I can see two approaches. We can implement both. Approach #1 allows the override of the SecurityContext on the proxy container. Approach #2 makes the container istio-compatible by default.

Approach 1: Allow override of container SecurityContext

This way, users can set their own SecurityContext for the proxy containers in the AuthProxyWorkload resource.

  1. Add a new SecurityContext field to AuthProxyContainerSpec similar to the AuthProxyContainerSpec.Resources field.
  2. Modify the applyContainerSpec() method to set the container.SecurityContext field to set the SecurityContext field. similar to how it is handled for Resources
  3. Add unit tests to podspec_update_tests.go, again similar to TestResourcesFromSpec.

Approach 2: Detect Istio

This way, the operator automatically configures the container when Istio is present.

  1. Add a field istioPresent bool the updateState struct.
  2. Detect if istio is present from the pod metadata in updateState.update(). On this line you can get the entire corev1.Pod resource as wl.Pod. I'm not sure how to determine from the Pod metadata whether Istio is installed. You will need to do some research.
  3. Modify the applyContainerSpec() method to set the container.SecurityContext field to set the user field appropriately if istioPresent is true. link
  4. Add unit tests to podspec_update_tests.go, You can copy TestPodUpdateWorkload() as a starting point.
  5. Test this end-to-end manually on your own Istio-enabled cluster and let us know the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants