Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-1131: Merge implementation machineOSBuild and machineOSConfig in MCO #4327

Merged
merged 6 commits into from
May 9, 2024

Conversation

inesqyx
Copy link
Contributor

@inesqyx inesqyx commented Apr 17, 2024

A further cleanup and rebase w/ the main branch on MCO-1042: ocb-api implementation in MCO. It is kept separated instead of merging into @cdoern 's PR because (1) The functionality of @cdoern 's PR has been verified, keeping a good copy of it without direct modifying it would help with future debugging (2) A big rebase to the kube/FG bump change in the master branch has happened in this PR. Bringing it back to @cdoern 's PR will involve a another round of conflict resolution and un-verified changes into the original solution.

For reviewers

For testing

Current Progress

  • Code Cleanup: Done, ready for review
  • Functionality Testing: WIP, not ready for testing
  • Test: WIP, not ready for review

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 17, 2024

@inesqyx: This pull request references MCO-1042 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

A further cleanup and rebase w/ the main branch on MCO-1042: ocb-api implementation in MCO. It is kept separated instead of merging into @cdoern 's PR because (1) The functionality of @cdoern 's PR has been verified, keeping a good copy of it without direct modifying it would help with future debugging (2) A big rebase to the kube/FG bump change in the master branch has happened in this PR. Bringing it back to @cdoern 's PR will involve a another round of conflict resolution and un-verified changes into the original solution.

For reviewers

For testing

Current Progress

  • Code Cleanup: Done, ready for review
  • Functionality Testing: WIP, not ready for testing
  • Test: WIP, not ready for review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 17, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2024
Copy link
Contributor

openshift-ci bot commented Apr 17, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2024
pkg/controller/build/build_controller_test.go Outdated Show resolved Hide resolved
<-ctx.Done()
// is this... allowed?
// since users can specify different settings per pool, we need to run a controller PER pool. Otherwise, settings will be conflated, as will failures and builds.
for _, ctrl := range controllers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if im understanding this correctly, we now potentially return multiple build controllers, each tied to a different MachineOSConfig? remind me why we made this change again?

pkg/apihelpers/machineosbuild_apihelpers.go Outdated Show resolved Hide resolved
}

// If the pool is not layered, this annotation should not exist.
return val == "" || !ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need distinguish between a legitimately empty/missing value and an erroneous state. i remember running into an issue like this on the revert work and needed to differentiate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdoern look at this!!

pkg/controller/common/layered_node_state.go Show resolved Hide resolved
pkg/controller/common/mos_state.go Outdated Show resolved Hide resolved
pkg/controller/common/mos_state.go Show resolved Hide resolved
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2024
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments from our live debug session

pkg/controller/node/status.go Outdated Show resolved Hide resolved
pkg/controller/node/node_controller.go Show resolved Hide resolved
return nil, nil, err
}

for _, config := range configList.Items {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider looking for duplicate pool name, and either error or parse


for _, build := range buildList.Items {
if build.Spec.MachineOSConfig.Name == ourConfig.Name {
if build.Spec.DesiredConfig.Name == pool.Spec.Configuration.Name {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider disambigufy since we may have the same machineconfig render, but different base image in a build

pkg/controller/node/status.go Outdated Show resolved Hide resolved
@inesqyx inesqyx marked this pull request as ready for review April 23, 2024 06:13
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2024
@inesqyx inesqyx force-pushed the MCO-1042-2 branch 4 times, most recently from 7399b1c to f368c74 Compare April 23, 2024 20:44
@yuqi-zhang
Copy link
Contributor

Need a make go-deps I think

@inesqyx
Copy link
Contributor Author

inesqyx commented Apr 24, 2024

/retest-required

@inesqyx inesqyx force-pushed the MCO-1042-2 branch 2 times, most recently from 805b215 to ba183da Compare April 25, 2024 19:05
Copy link
Member

@cheesesashimi cheesesashimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inesqyx / @umohnani8 / @cdoern I took the closest look at this that I could before my PTO. After writing e2e tests and getting better acquainted with this code, I was better able to find a few issues that should be at least discussed, if not addressed completely.

cmd/machine-os-builder/start.go Show resolved Hide resolved
cmd/machine-os-builder/start.go Outdated Show resolved Hide resolved
go ctrl.Run(ctx, 5)
<-ctx.Done()
// is this... allowed?
// since users can specify different settings per pool, we need to run a controller PER pool. Otherwise, settings will be conflated, as will failures and builds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What happens when a MachineOSConfig is added after we've started the BuildControllers? How does it get a BuildController instance assigned to it? Conversely, if one of several MachineOSConfigs is then deleted, how does the BuildController associated with it get stopped?

pkg/controller/build/build_controller.go Outdated Show resolved Hide resolved
pkg/controller/build/build_controller.go Outdated Show resolved Hide resolved
pkg/controller/build/image_build_request.go Show resolved Hide resolved
pkg/controller/build/image_build_request.go Show resolved Hide resolved
pkg/controller/node/node_controller.go Show resolved Hide resolved
pkg/controller/build/build_controller.go Outdated Show resolved Hide resolved
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2024

return build.NewWithCustomPodBuilder(cfg, buildClients), nil
for range machineOSConfigs.Items {
controllersToStart = append(controllersToStart, build.NewWithCustomPodBuilder(cfg, buildClients))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): If the intent is to have a single BuildController handle a single MachineOSConfig, how does a BuildController instance know which MachineOSConfig to handle? It doesn't look like the instantiated BuildController has any knowledge about a particular MachineOSConfig.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have a clear answer on how these get bound to a specific BuildController instance, we may want to only start a single BuildController instance. The reason is because when multiple MachineOSConfigs are created, the multiple BuildController instances don't know which MachineOSConfig to target, so they target whichever one has an event associated with it. This results in a lot of log messages like this:

I0506 20:45:50.806184       1 build_controller.go:471] Error syncing machineosbuild pool-0: machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists
I0506 20:45:50.847069       1 build_controller.go:484] Started syncing build "pool-0" (2024-05-06 20:45:50.847055006 +0000 UTC m=+543.347506299)
I0506 20:45:50.876267       1 build_controller.go:486] Finished syncing machineOSBuilder "pool-0" (29.201665ms)
I0506 20:45:50.878236       1 build_controller.go:471] Error syncing machineosbuild pool-0: machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists
I0506 20:45:50.971138       1 build_controller.go:484] Started syncing build "pool-0" (2024-05-06 20:45:50.971123152 +0000 UTC m=+543.471574548)
I0506 20:45:50.984172       1 build_controller.go:486] Finished syncing machineOSBuilder "pool-0" (13.039368ms)
E0506 20:45:50.984291       1 build_controller.go:476] machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists
I0506 20:45:50.984316       1 build_controller.go:477] Dropping machineosbuild "pool-0" out of the queue: machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists

Basically, all of the BuildControllers react to a change, create a new object, etc. The first one to effect the change "wins". The rest end up returning errors until they drop the request from their queues. While this doesn't appear to have any effect on how builds are started and lifecycled, each of those messages has numerous API calls associated with it. In resource-constrained environments such as SNO, that can be problematic since we have more API calls than are necessary. As a side-effect, other controllers that watch for MachineOSBuilds / MachineOSConfigs can also react multiple time and cause unintended side-effects.

EDIT: I tested a version of this code that only has a single BuildController instance and everything seemed to work fine without causing the additional API requests.

umohnani8 and others added 4 commits May 8, 2024 07:57
Signed-off-by: Urvashi Mohnani <[email protected]>
This adds the capability for BuildController to use the RHEL entitlement
secrets to allow cluster admins to inject RHEL content into their builds
that they are entitled to receive. This also allows the injection /
consumption of content into /etc/yum.repos.d as well as
/etc/pki/rpm-gpg. There are a few notes about the implementation that I
would like to have at a higher level:

- Because we run rootless Buildah, we're more prone to running into
  SELinux complications. This makes it more difficult to directly mount
  the contents of /etc/yum.repos.d, /etc/pki/entitlement, and
  /etc/pki/rpm-gpg directly into the build context. With that in mind,
  we copy everything into a series of temp directories first, and then
  mount those temp directories into the build context as a volume.
- We also create an emptyDir which is mounted into the build pod at
  /home/build/.local/share/containers. It is unclear why this is
  necessary, but as mentioned before, I suspect that this is due to
  SELinux issues.
- The e2e test suite now has the capability to stream the container logs
  from the build pod to the filesystem as there is useful information
  contained within those logs if the e2e test fails. In OpenShift CI,
  this location will be determined by the ARTIFACT_DIR env var. If this
  env var is not present, it will default the current directory.
- For right now, etc-pki-entitlement flow (specifically, the
  TestEntitledBuild test) is being skipped in OpenShift CI because the
  test clusters do not have that cred available. The test suite will
  automatically detect the presence (or lack thereof) of that cred in
  the openshift-config-managed namespace and run the test if it is
  available. However, the TestYumRepos test targets a very similar flow
  and can do its own setup and teardown regardless of creds preexisting.

Additionally, I took care to ensure that this does not break OKD by taking the following actions:

- I observed that the addition of the
  /home/build/.local/share/containers volume mount to the build pod
  prevented the wait-for-done container to start when running on FCOS.
  With this in mind, I modified the build pod instantiation to not
  connect this volume mount to the wait-for-done container.
- I added a TestOnClusterBuildsOnOKD e2e test which will only run
  against an OKD cluster. Conversely, I excluded other tests from
  running against an OKD cluster since those tests make assumptions
  about things that would only be present within an OCP cluster.
@sergiordlr
Copy link

Pre-merge testing has been tracked here: https://issues.redhat.com/browse/MCO-1149

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 8, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 8, 2024

@inesqyx: This pull request references MCO-1131 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

A further cleanup and rebase w/ the main branch on MCO-1042: ocb-api implementation in MCO. It is kept separated instead of merging into @cdoern 's PR because (1) The functionality of @cdoern 's PR has been verified, keeping a good copy of it without direct modifying it would help with future debugging (2) A big rebase to the kube/FG bump change in the master branch has happened in this PR. Bringing it back to @cdoern 's PR will involve a another round of conflict resolution and un-verified changes into the original solution.

For reviewers

For testing

Current Progress

  • Code Cleanup: Done, ready for review
  • Functionality Testing: WIP, not ready for testing
  • Test: WIP, not ready for review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

// Determines if we should run a build, then starts a build pod to perform the
// build, and updates the MachineConfigPool with an object reference for the
// build pod.
func (ctrl *Controller) startBuildForMachineConfigPool(mosc *mcfgv1alpha1.MachineOSConfig, mosb *mcfgv1alpha1.MachineOSBuild) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): We should change the name of this method to something like startBuild or similar since the MachineConfigPool is no longer the source of truth for build state.

(We can address this later though.)

IsBuildRunning(*mcfgv1.MachineConfigPool) (bool, error)
}, oldPool, curPool *mcfgv1.MachineConfigPool) (bool, error) {
ps := newPoolState(curPool)
func (ctrl *Controller) doesMOSBExist(mosc *mcfgv1alpha1.MachineOSConfig) (*mcfgv1alpha1.MachineOSBuild, bool) {
Copy link
Member

@cheesesashimi cheesesashimi May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nicely done with this!

We'll probably want to make a more canonical shared implementation of this in the future since anything that works with MachineOSConfigs and MachineOSBuilds will need to perform similar operations. That doesn't need to be done at this time though.


if !poolStateSuggestsBuild {
return false, nil
func (ctrl *Controller) getConfigAndBuildForPool(pool *mcfgv1.MachineConfigPool) (*mcfgv1alpha1.MachineOSConfig, *mcfgv1alpha1.MachineOSBuild, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Again, nicely done with this!

As mentioned before, we'll probably want to hoist this out of this package in the future. But that does not need to be done right now.

return false
}
}
// for _, condition := range getMachineConfigPoolBuildConditions() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): If this is commented out, we should probably just remove it.

@@ -1,335 +0,0 @@
package build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I'm not sad to see this go away 😄

RunAsUser: &uid,
RunAsGroup: &gid,
// Octal: 0755.
var mountMode int32 = 493
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note (non-blocking): One can use an octal literal by doing something like var mountMode int32 = 0o755. Not sure if that would work here, but it's just a thought.

(This does not need to be addressed right now though.)

@@ -513,7 +433,11 @@ func (i ImageBuildRequest) toBuildahPod() *corev1.Pod {
Command: append(command, buildahBuildScript),
ImagePullPolicy: corev1.PullAlways,
SecurityContext: securityContext,
VolumeMounts: volumeMounts,
// Only attach the buildah-cache volume mount to the buildah container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: In the future, I want to be more decisive about what volume mounts, env vars, etc., are attached to which container.

(Does not need to be addressed right now)

@@ -1,96 +1,86 @@
package common

import (
"testing"
// func TestMachineOSBuildState(t *testing.T) {
Copy link
Member

@cheesesashimi cheesesashimi May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I want to get these tests running again.

(Does not need to be addressed right now because the existing test suites / etc. exercise most of the paths here. I don't think the code coverage analysis takes that into consideration though.)

@@ -1079,8 +1214,8 @@ func getAllCandidateMachines(pool *mcfgv1.MachineConfigPool, nodesInPool []*core
}

// getCandidateMachines returns the maximum subset of nodes which can be updated to the target config given availability constraints.
func getCandidateMachines(pool *mcfgv1.MachineConfigPool, nodesInPool []*corev1.Node, maxUnavailable int) []*corev1.Node {
nodes, capacity := getAllCandidateMachines(pool, nodesInPool, maxUnavailable)
func getCandidateMachines(pool *mcfgv1.MachineConfigPool, config *mcfgv1alpha1.MachineOSConfig, build *mcfgv1alpha1.MachineOSBuild, nodesInPool []*corev1.Node, maxUnavailable int, layered bool) []*corev1.Node {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Make getCandidateMachines() accept a struct like candidateMachineOpts or similar. This makes it very clear at each call-site what each parameter is for.

(This does not need to be done right now though.)

@@ -1046,44 +1053,49 @@ func TestShouldMakeProgress(t *testing.T) {
expectTaintsAddPatch: false,
},
{
description: "node not at desired config, patch on annotation and taints",
description: "node not at desired config, patch on annotation and taints", // failing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is this test still failing? We may want to remove the // failing comment if so.

var updating int32
for _, set := range mcs.Status.PinnedImageSets {
if set.CurrentGeneration != set.DesiredGeneration {
updating++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Once we encounter the first condition where set.CurrentGeneration != set.DesiredGeneration, we could probably return true early since we don't need to examine the rest of the conditions. Only once we've examined all of the conditions and found none to be unequal, we can return false.

(Does not need to be addressed right now.)

}

// this is for the FG
// informers = append(informers, moscInformer.Informer())
Copy link
Member

@cheesesashimi cheesesashimi May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is this informer still being instantiated?

EDIT: Instantiating it is unnecessary because all of the functions that handle the Machine OS Builder check for the presence of the featuregate before doing anything.

@cheesesashimi
Copy link
Member

/lgtm
/approval

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2024
Copy link
Contributor

openshift-ci bot commented May 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, inesqyx, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cheesesashimi,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit deebbcb into openshift:master May 9, 2024
15 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.16.0-202405100946.p0.gdeebbcb.assembly.stream.el9 for distgit ose-machine-config-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.