You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the controllers in 'tekton-pipelines-remote-resolvers' and then consumer controllers in 'tekton-pipeline-controller' should retry on well known transient k8s errors like timeouts/deadlines-exceeded
Actual Behavior
right now permanent errors are thrown and no retry occurs
Steps to Reproduce the Problem
unfortunately, you need a system with enough load to incur such a transient error here
Additional Info
Kubernetes version:
minimally 26.x and later
Tekton Pipeline version:
minimally v0.58.0 and later
On our production system, during both sides of remote resolution (core controller and resolver) typically transient kubernetes errors were being treated as permanent knative errors and no attempts at trying to reconcile again were made, leading to failures which could be avoided.
An example log snippet from the core controller:
Pipeline rh-acs-tenant/operator-on-pull-request-bwqxj can't be Run; it contains Tasks that don't exist: Couldn't retrieve Task "": retryable error validating referenced object source-build: Internal error occurred: failed calling webhook "validation.webhook.pipeline.tekton.dev": failed to call webhook: Post "https://tekton-pipelines-webhook.openshift-pipelines.svc:443/resource-validation?timeout=10s": context deadline exceeded
Accompanying log snippet from the resolver
{"level":"error","ts":"2024-04-17T10:50:05.866Z","logger":"controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"f0a1d64","knative.dev/traceid":"b893d6a6-2eb7-4a53-b502-1348803a7085","knative.dev/key":"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770","duration":"10.3628985s","error":"error updating resource request \"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770\" with data: Internal error occurred: failed calling webhook \"webhook.pipeline.tekton.dev\": failed to call webhook: Post \"https://tekton-pipelines-webhook.openshift-pipelines.svc:443/defaulting?timeout=10s\": context deadline exceeded","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:491"}
The text was updated successfully, but these errors were encountered:
Expected Behavior
the controllers in 'tekton-pipelines-remote-resolvers' and then consumer controllers in 'tekton-pipeline-controller' should retry on well known transient k8s errors like timeouts/deadlines-exceeded
Actual Behavior
right now permanent errors are thrown and no retry occurs
Steps to Reproduce the Problem
Additional Info
On our production system, during both sides of remote resolution (core controller and resolver) typically transient kubernetes errors were being treated as permanent knative errors and no attempts at trying to reconcile again were made, leading to failures which could be avoided.
An example log snippet from the core controller:
Pipeline rh-acs-tenant/operator-on-pull-request-bwqxj can't be Run; it contains Tasks that don't exist: Couldn't retrieve Task "": retryable error validating referenced object source-build: Internal error occurred: failed calling webhook "validation.webhook.pipeline.tekton.dev": failed to call webhook: Post "https://tekton-pipelines-webhook.openshift-pipelines.svc:443/resource-validation?timeout=10s": context deadline exceeded
Accompanying log snippet from the resolver
{"level":"error","ts":"2024-04-17T10:50:05.866Z","logger":"controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"f0a1d64","knative.dev/traceid":"b893d6a6-2eb7-4a53-b502-1348803a7085","knative.dev/key":"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770","duration":"10.3628985s","error":"error updating resource request \"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770\" with data: Internal error occurred: failed calling webhook \"webhook.pipeline.tekton.dev\": failed to call webhook: Post \"https://tekton-pipelines-webhook.openshift-pipelines.svc:443/defaulting?timeout=10s\": context deadline exceeded","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:491"}
The text was updated successfully, but these errors were encountered: