Skip to content

postStart hook commands timeout #1440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

akurinnoy
Copy link
Contributor

What does this PR do?

This PR addresses the issue of postStart hook failures in DevWorkspaces when hook commands not exiting within the timeout period, so that the workspace pod gets stuck in Terminating state and never gets deleted.

This PR resolves the issue by:

  • Introducing timeout for postStart hook. User-provided commands are now wrapped with the timeout utility. This ensures that postStart hook commands are terminated if they exceed a configurable duration. The timeout duration can be set in the DevWorkspaceOperatorConfig (a value of 0 means no timeout):
    # DevWorkspaceOperatorConfig
    # ...
    config:
      workspace:
        postStartTimeout: 30 # Timeout in seconds
  • Adding the parsing logic for interpreting various Kubelet messages to extract an exact reason or exit code for lifecycle hook failures.

What issues does this PR fix or reference?

https://issues.redhat.com/browse/CRW-8329

Is it tested? How?

  1. Install DWO from this PR:
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: devworkspace-operator-catalog
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/okurinny/devworkspace-operator-index:postStartHookTimeout
  publisher: Red Hat
  displayName: DevWorkspace Operator Catalog
  updateStrategy:
    registryPoll:
      interval: 5m
EOF
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: devworkspace-operator
  namespace: openshift-operators
spec:
  channel: next
  installPlanApproval: Automatic
  name: devworkspace-operator
  source: devworkspace-operator-catalog
  sourceNamespace: openshift-marketplace
EOF
  1. Create DevWorkspaceOperatorConfig with the postStart hook timeout duration (in seconds):
oc apply -f - <<EOF
apiVersion: controller.devfile.io/v1alpha1
kind: DevWorkspaceOperatorConfig
metadata:
  name: devworkspace-operator-config
  namespace: openshift-operators
config:
  workspace:
    postStartTimeout: 30
EOF
  1. Create a problematic DevWorkspace designed to have its postStart hook time out:
oc apply -f - <<EOF
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: problematic-workspace
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/devfile/universal-developer-image:ubi9-latest
          memoryLimit: "1Gi"
          memoryRequest: "512Mi"
          cpuRequest: "250m"
          cpuLimit: "1000m"
    commands:
      - id: sleep-infinity-cmd
        exec:
          component: tools
          commandLine: "echo 'PostStart: Starting infinite sleep...'; sleep infinity; echo 'PostStart: Sleep finished (should not be reached)'"
    events:
      postStart:
        - sleep-infinity-cmd
EOF
  1. Watch the DevWorkspace:
oc get dw problematic-workspace -w
  1. The DevWorkspace should eventually enter a Failed phase.
  2. The status.message of the DevWorkspace should provide a reason for the failure, indicating a timeout. For example: Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands terminated by SIGTERM (likely timed out after 30s). Exit code 143.

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@akurinnoy akurinnoy self-assigned this May 29, 2025
@akurinnoy akurinnoy requested review from dkwon17 and ibuziuk as code owners May 29, 2025 13:06
Copy link

openshift-ci bot commented May 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akurinnoy
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@akurinnoy akurinnoy requested a review from rohanKanojia May 29, 2025 13:07
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from e90b773 to c342798 Compare May 29, 2025 13:56
@rohanKanojia
Copy link
Contributor

I tried the abovementioned steps and I was able to see probelematic workspace failing with [postStart hook] message:

oc get pods -w
NAME                                               READY   STATUS              RESTARTS   AGE
devworkspace-controller-manager-6c948bbf56-k6262   2/2     Running             0          32m
devworkspace-webhook-server-8597b84fc4-kglmf       2/2     Running             0          32m
devworkspace-webhook-server-8597b84fc4-m9rc5       2/2     Running             0          32m
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     ContainerCreating   0          6s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     ContainerCreating   0          9s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     PostStartHookError   0          15s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Terminating          0          15s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Terminating          1 (14s ago)   28s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             28s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             29s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             29s

oc get dw
NAME                    DEVWORKSPACE ID             PHASE    INFO
problematic-workspace   workspace35712747d3d64d73   Failed   Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands failed (Kubelet reported exit code 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants