Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove provisioning paused annotation #6306

Conversation

carbonin
Copy link
Member

@carbonin carbonin commented May 10, 2024

This was added to avoid a race between the provisioning stack running on the bootstrap control plane and the real one running in the installed cluster. During IPI this annotation is removed when bootstrap is finished, but for assisted and agent installs it was never getting removed.

This was causing baremetal operator to never start in the installed cluster which prevented the machine API from recognizing that all machines were running. This, in turn, resulted in the cluster never being able to finish finalizing with the machine API operator reporting an degraded condition like:

- lastTransitionTime: "2024-05-09T22:11:21Z"
  message: 'Failed when progressing towards operator: 4.16.0-0.nightly-2024-05-08-222442
    because minimum worker replica count (2) not yet met: current running replicas
    0, waiting for [test-infra-cluster-96-64zpj-worker-0-59bzh test-infra-cluster-96-64zpj-worker-0-cktvz]'
  reason: SyncingFailed
  status: "True"
  type: Degraded

List all the issues related to this PR

Caused by https://issues.redhat.com/browse/OCPBUGS-33157
Related to https://issues.redhat.com/browse/OCPBUGS-33493

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see [CONTRIBUTING] guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 10, 2024
@openshift-ci openshift-ci bot requested review from eranco74 and tsorya May 10, 2024 16:13
Copy link

openshift-ci bot commented May 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carbonin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2024
@carbonin carbonin force-pushed the remove_provisioning_paused_annotation branch from b45aad7 to 69d9d6c Compare May 10, 2024 16:17
Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 60.52632% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 68.28%. Comparing base (904ee71) to head (5e534e3).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6306      +/-   ##
==========================================
- Coverage   68.29%   68.28%   -0.01%     
==========================================
  Files         241      241              
  Lines       35863    35878      +15     
==========================================
+ Hits        24491    24499       +8     
- Misses       9212     9215       +3     
- Partials     2160     2164       +4     
Files Coverage Δ
internal/ignition/ignition.go 60.30% <60.52%> (-0.28%) ⬇️

... and 1 file with indirect coverage changes

This was added to avoid a race between the provisioning stack running
on the bootstrap control plane and the real one running in the installed
cluster. During IPI this annotation is removed when bootstrap is
finished, but for assisted and agent installs it was never getting
removed.

This was causing baremetal operator to never start in the installed
cluster which prevented the machine API from recognizing that all
machines were running. This, in turn, resulted in the cluster never
being able to finish finalizing with the machine API operator reporting
an degraded condition like:

```
- lastTransitionTime: "2024-05-09T22:11:21Z"
  message: 'Failed when progressing towards operator: 4.16.0-0.nightly-2024-05-08-222442
    because minimum worker replica count (2) not yet met: current running replicas
    0, waiting for [test-infra-cluster-96-64zpj-worker-0-59bzh test-infra-cluster-96-64zpj-worker-0-cktvz]'
  reason: SyncingFailed
  status: "True"
  type: Degraded
```

Caused by https://issues.redhat.com/browse/OCPBUGS-33157
Related to https://issues.redhat.com/browse/OCPBUGS-33493
@carbonin carbonin force-pushed the remove_provisioning_paused_annotation branch from 69d9d6c to 5e534e3 Compare May 10, 2024 20:09
@carbonin carbonin changed the title [WIP] Remove provisioning paused annotation Remove provisioning paused annotation May 10, 2024
@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 10, 2024
Copy link

openshift-ci bot commented May 10, 2024

@carbonin: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -885,13 +887,18 @@ func (g *installerGenerator) updateBootstrap(ctx context.Context, bootstrapPath
// drop this from the list of Files because we don't want to run BMO
continue
}
if provErr := removeProvisioningPausedAnnotation(&config.Storage.Files[i]); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, whenever assisted is creating a BMH resource the BMH status is unmanaged (SaaS) or externally provisioned (ZTP). if assisted behavior change or ironic/metal3 will try to update the BMh status we will end up with the original issue (https://issues.redhat.com//browse/OCPBUGS-33157).
Perhaps removing this annotation from the assited-installer-controller after all control plane nodes are ready is preferable since it aligns with the desired CBO behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation is purely to prevent BMO in the cluster interfering with the BMO on the baremetal bootstrap host while it is still doing its work.
In a system with no baremetal bootstrap host it is not required.

func fileToBMH(file *config_latest_types.File) (*bmh_v1alpha1.BareMetalHost, error) {
func removeProvisioningPausedAnnotation(file *config_latest_types.File) error {
provisioning := &metal3iov1alpha1.Provisioning{}
if err := deserializeToObject(file, provisioning); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause us problems in future, because the API definition will have to be kept up to date every time a field is added to the Provisioning resource, and we will certainly forget that this is here and then wonder why the new data is getting silently dropped.

This might still be the best thing to do in the short term, but it's not a great substitute for a long-term fix that doesn't require adding this annotation at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long do you think it will be until we get a proper long-term fix?
Do you think doing this in the controller against the actual cluster with a patch API call will be any more maintainable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got a successful test on openshift/installer#8391 but there's a few moving parts so it might take a minute to land.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a patch API call would avoid the out-of-date-API-definition problem.

@carbonin
Copy link
Member Author

Closing this as the issue was fixed in the most recent 4.16 payloads

@carbonin carbonin closed this May 22, 2024
@carbonin carbonin deleted the remove_provisioning_paused_annotation branch May 22, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants