-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-1131: Merge implementation machineOSBuild and machineOSConfig in MCO #4327
Conversation
@inesqyx: This pull request references MCO-1042 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
cmd/machine-os-builder/start.go
Outdated
<-ctx.Done() | ||
// is this... allowed? | ||
// since users can specify different settings per pool, we need to run a controller PER pool. Otherwise, settings will be conflated, as will failures and builds. | ||
for _, ctrl := range controllers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if im understanding this correctly, we now potentially return multiple build controllers, each tied to a different MachineOSConfig? remind me why we made this change again?
} | ||
|
||
// If the pool is not layered, this annotation should not exist. | ||
return val == "" || !ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need distinguish between a legitimately empty/missing value and an erroneous state. i remember running into an issue like this on the revert work and needed to differentiate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdoern look at this!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some comments from our live debug session
return nil, nil, err | ||
} | ||
|
||
for _, config := range configList.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider looking for duplicate pool name, and either error or parse
|
||
for _, build := range buildList.Items { | ||
if build.Spec.MachineOSConfig.Name == ourConfig.Name { | ||
if build.Spec.DesiredConfig.Name == pool.Spec.Configuration.Name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider disambigufy since we may have the same machineconfig render, but different base image in a build
install/0000_80_machine-config_01_machineosbuild-TechPreviewNoUpgrade.crd.yaml
Outdated
Show resolved
Hide resolved
7399b1c
to
f368c74
Compare
Need a |
/retest-required |
805b215
to
ba183da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inesqyx / @umohnani8 / @cdoern I took the closest look at this that I could before my PTO. After writing e2e tests and getting better acquainted with this code, I was better able to find a few issues that should be at least discussed, if not addressed completely.
cmd/machine-os-builder/start.go
Outdated
go ctrl.Run(ctx, 5) | ||
<-ctx.Done() | ||
// is this... allowed? | ||
// since users can specify different settings per pool, we need to run a controller PER pool. Otherwise, settings will be conflated, as will failures and builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What happens when a MachineOSConfig is added after we've started the BuildControllers? How does it get a BuildController instance assigned to it? Conversely, if one of several MachineOSConfigs is then deleted, how does the BuildController associated with it get stopped?
cmd/machine-os-builder/start.go
Outdated
|
||
return build.NewWithCustomPodBuilder(cfg, buildClients), nil | ||
for range machineOSConfigs.Items { | ||
controllersToStart = append(controllersToStart, build.NewWithCustomPodBuilder(cfg, buildClients)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): If the intent is to have a single BuildController handle a single MachineOSConfig, how does a BuildController instance know which MachineOSConfig to handle? It doesn't look like the instantiated BuildController has any knowledge about a particular MachineOSConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have a clear answer on how these get bound to a specific BuildController instance, we may want to only start a single BuildController instance. The reason is because when multiple MachineOSConfigs are created, the multiple BuildController instances don't know which MachineOSConfig to target, so they target whichever one has an event associated with it. This results in a lot of log messages like this:
I0506 20:45:50.806184 1 build_controller.go:471] Error syncing machineosbuild pool-0: machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists
I0506 20:45:50.847069 1 build_controller.go:484] Started syncing build "pool-0" (2024-05-06 20:45:50.847055006 +0000 UTC m=+543.347506299)
I0506 20:45:50.876267 1 build_controller.go:486] Finished syncing machineOSBuilder "pool-0" (29.201665ms)
I0506 20:45:50.878236 1 build_controller.go:471] Error syncing machineosbuild pool-0: machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists
I0506 20:45:50.971138 1 build_controller.go:484] Started syncing build "pool-0" (2024-05-06 20:45:50.971123152 +0000 UTC m=+543.471574548)
I0506 20:45:50.984172 1 build_controller.go:486] Finished syncing machineOSBuilder "pool-0" (13.039368ms)
E0506 20:45:50.984291 1 build_controller.go:476] machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists
I0506 20:45:50.984316 1 build_controller.go:477] Dropping machineosbuild "pool-0" out of the queue: machineosbuilds.machineconfiguration.openshift.io "pool-0-rendered-pool-0-aff68db9c6081b0d56cd746090f860c0-builder" already exists
Basically, all of the BuildControllers react to a change, create a new object, etc. The first one to effect the change "wins". The rest end up returning errors until they drop the request from their queues. While this doesn't appear to have any effect on how builds are started and lifecycled, each of those messages has numerous API calls associated with it. In resource-constrained environments such as SNO, that can be problematic since we have more API calls than are necessary. As a side-effect, other controllers that watch for MachineOSBuilds / MachineOSConfigs can also react multiple time and cause unintended side-effects.
EDIT: I tested a version of this code that only has a single BuildController instance and everything seemed to work fine without causing the additional API requests.
Signed-off-by: Urvashi Mohnani <[email protected]>
This adds the capability for BuildController to use the RHEL entitlement secrets to allow cluster admins to inject RHEL content into their builds that they are entitled to receive. This also allows the injection / consumption of content into /etc/yum.repos.d as well as /etc/pki/rpm-gpg. There are a few notes about the implementation that I would like to have at a higher level: - Because we run rootless Buildah, we're more prone to running into SELinux complications. This makes it more difficult to directly mount the contents of /etc/yum.repos.d, /etc/pki/entitlement, and /etc/pki/rpm-gpg directly into the build context. With that in mind, we copy everything into a series of temp directories first, and then mount those temp directories into the build context as a volume. - We also create an emptyDir which is mounted into the build pod at /home/build/.local/share/containers. It is unclear why this is necessary, but as mentioned before, I suspect that this is due to SELinux issues. - The e2e test suite now has the capability to stream the container logs from the build pod to the filesystem as there is useful information contained within those logs if the e2e test fails. In OpenShift CI, this location will be determined by the ARTIFACT_DIR env var. If this env var is not present, it will default the current directory. - For right now, etc-pki-entitlement flow (specifically, the TestEntitledBuild test) is being skipped in OpenShift CI because the test clusters do not have that cred available. The test suite will automatically detect the presence (or lack thereof) of that cred in the openshift-config-managed namespace and run the test if it is available. However, the TestYumRepos test targets a very similar flow and can do its own setup and teardown regardless of creds preexisting. Additionally, I took care to ensure that this does not break OKD by taking the following actions: - I observed that the addition of the /home/build/.local/share/containers volume mount to the build pod prevented the wait-for-done container to start when running on FCOS. With this in mind, I modified the build pod instantiation to not connect this volume mount to the wait-for-done container. - I added a TestOnClusterBuildsOnOKD e2e test which will only run against an OKD cluster. Conversely, I excluded other tests from running against an OKD cluster since those tests make assumptions about things that would only be present within an OCP cluster.
Pre-merge testing has been tracked here: https://issues.redhat.com/browse/MCO-1149 /label qe-approved |
@inesqyx: This pull request references MCO-1131 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
// Determines if we should run a build, then starts a build pod to perform the | ||
// build, and updates the MachineConfigPool with an object reference for the | ||
// build pod. | ||
func (ctrl *Controller) startBuildForMachineConfigPool(mosc *mcfgv1alpha1.MachineOSConfig, mosb *mcfgv1alpha1.MachineOSBuild) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): We should change the name of this method to something like startBuild
or similar since the MachineConfigPool is no longer the source of truth for build state.
(We can address this later though.)
IsBuildRunning(*mcfgv1.MachineConfigPool) (bool, error) | ||
}, oldPool, curPool *mcfgv1.MachineConfigPool) (bool, error) { | ||
ps := newPoolState(curPool) | ||
func (ctrl *Controller) doesMOSBExist(mosc *mcfgv1alpha1.MachineOSConfig) (*mcfgv1alpha1.MachineOSBuild, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Nicely done with this!
We'll probably want to make a more canonical shared implementation of this in the future since anything that works with MachineOSConfigs and MachineOSBuilds will need to perform similar operations. That doesn't need to be done at this time though.
|
||
if !poolStateSuggestsBuild { | ||
return false, nil | ||
func (ctrl *Controller) getConfigAndBuildForPool(pool *mcfgv1.MachineConfigPool) (*mcfgv1alpha1.MachineOSConfig, *mcfgv1alpha1.MachineOSBuild, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Again, nicely done with this!
As mentioned before, we'll probably want to hoist this out of this package in the future. But that does not need to be done right now.
return false | ||
} | ||
} | ||
// for _, condition := range getMachineConfigPoolBuildConditions() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): If this is commented out, we should probably just remove it.
@@ -1,335 +0,0 @@ | |||
package build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I'm not sad to see this go away 😄
RunAsUser: &uid, | ||
RunAsGroup: &gid, | ||
// Octal: 0755. | ||
var mountMode int32 = 493 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note (non-blocking): One can use an octal literal by doing something like var mountMode int32 = 0o755
. Not sure if that would work here, but it's just a thought.
(This does not need to be addressed right now though.)
@@ -513,7 +433,11 @@ func (i ImageBuildRequest) toBuildahPod() *corev1.Pod { | |||
Command: append(command, buildahBuildScript), | |||
ImagePullPolicy: corev1.PullAlways, | |||
SecurityContext: securityContext, | |||
VolumeMounts: volumeMounts, | |||
// Only attach the buildah-cache volume mount to the buildah container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: In the future, I want to be more decisive about what volume mounts, env vars, etc., are attached to which container.
(Does not need to be addressed right now)
@@ -1,96 +1,86 @@ | |||
package common | |||
|
|||
import ( | |||
"testing" | |||
// func TestMachineOSBuildState(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I want to get these tests running again.
(Does not need to be addressed right now because the existing test suites / etc. exercise most of the paths here. I don't think the code coverage analysis takes that into consideration though.)
@@ -1079,8 +1214,8 @@ func getAllCandidateMachines(pool *mcfgv1.MachineConfigPool, nodesInPool []*core | |||
} | |||
|
|||
// getCandidateMachines returns the maximum subset of nodes which can be updated to the target config given availability constraints. | |||
func getCandidateMachines(pool *mcfgv1.MachineConfigPool, nodesInPool []*corev1.Node, maxUnavailable int) []*corev1.Node { | |||
nodes, capacity := getAllCandidateMachines(pool, nodesInPool, maxUnavailable) | |||
func getCandidateMachines(pool *mcfgv1.MachineConfigPool, config *mcfgv1alpha1.MachineOSConfig, build *mcfgv1alpha1.MachineOSBuild, nodesInPool []*corev1.Node, maxUnavailable int, layered bool) []*corev1.Node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Make getCandidateMachines()
accept a struct like candidateMachineOpts
or similar. This makes it very clear at each call-site what each parameter is for.
(This does not need to be done right now though.)
@@ -1046,44 +1053,49 @@ func TestShouldMakeProgress(t *testing.T) { | |||
expectTaintsAddPatch: false, | |||
}, | |||
{ | |||
description: "node not at desired config, patch on annotation and taints", | |||
description: "node not at desired config, patch on annotation and taints", // failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is this test still failing? We may want to remove the // failing
comment if so.
var updating int32 | ||
for _, set := range mcs.Status.PinnedImageSets { | ||
if set.CurrentGeneration != set.DesiredGeneration { | ||
updating++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Once we encounter the first condition where set.CurrentGeneration != set.DesiredGeneration
, we could probably return true
early since we don't need to examine the rest of the conditions. Only once we've examined all of the conditions and found none to be unequal, we can return false
.
(Does not need to be addressed right now.)
} | ||
|
||
// this is for the FG | ||
// informers = append(informers, moscInformer.Informer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is this informer still being instantiated?
EDIT: Instantiating it is unnecessary because all of the functions that handle the Machine OS Builder check for the presence of the featuregate before doing anything.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi, inesqyx, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-machine-config-operator-container-v4.16.0-202405100946.p0.gdeebbcb.assembly.stream.el9 for distgit ose-machine-config-operator. |
A further cleanup and rebase w/ the main branch on MCO-1042: ocb-api implementation in MCO. It is kept separated instead of merging into @cdoern 's PR because (1) The functionality of @cdoern 's PR has been verified, keeping a good copy of it without direct modifying it would help with future debugging (2) A big rebase to the kube/FG bump change in the master branch has happened in this PR. Bringing it back to @cdoern 's PR will involve a another round of conflict resolution and un-verified changes into the original solution.
For reviewers
For testing
Current Progress