Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance flow for custom machineconfigs for specific machinesets #1619

Open
cgwalters opened this issue Apr 6, 2020 · 19 comments
Open

enhance flow for custom machineconfigs for specific machinesets #1619

cgwalters opened this issue Apr 6, 2020 · 19 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@cgwalters
Copy link
Member

The CI team is trying to use new AWS m5d.xlarge instances which have two NVMe disks attached. We crafted a custom RAID partition machineconfig to enable that.

We added this as part of the main worker pool - the MCD will fail to roll out the partitioning on the existing workers, but that's fine because the plan was to "roll" the worker pool. Basically get the new MC in the pool, have new workers come online with that config, then scale down the old workers.

However, there are a few issues here.

First, this whole thing would obviously be a lot better if we had machineset-specific machineconfigs. That would solve a bunch of races and be much more elegant.

What we're seeing right now is that one new m5d node went OutOfDisk=true because it was booted with just a 16G root volume from the old config. That unschedulable node then blocks rollout of further changes.

I think we can unstick ourselves here by deleting that node and getting the MCO to roll out the new config.

@cgwalters
Copy link
Member Author

And the real problem here is that the MCS serves:

currConf := mp.Status.Configuration.Name

So new nodes will only get the last config that successfully rolled out, so this attempt deadlocks.
We really need machineset-specific MCs.

@cgwalters
Copy link
Member Author

Another idea: some sort of "force MCS to serve mp.Spec.Configuration.Name" config option?

@cgwalters
Copy link
Member Author

First, this whole thing would obviously be a lot better if we had machineset-specific machineconfigs.

This though gets into the "node identity" problem - see also #784
Which could be e.g. a secret value passed through user data and then provided to the MCS, which can link that to a machine object, and from there a machineset.

@cgwalters
Copy link
Member Author

cgwalters commented Apr 6, 2020

If someone wants to change the machine type of their workers today I think it'd require a custom worker-new pool and also creating a custom user-data that does s/worker/worker-new/ that's referenced from the new machineset.

Alternatively, scaling down to 0 workers then back up should work, but obviously that's a lot more disruptive.

@cgwalters
Copy link
Member Author

The other alternative would be having something like a spec.machineSet in a pool - then the MCO could understand how to say "okay, the new config has partitioning changes, so let me destroy the machine objects and provide new configs to newly provisioned nodes". That'd be the most elegant but would require tightest integration between MCO and machineAPI.

@cgwalters
Copy link
Member Author

cgwalters commented Apr 6, 2020

An entirely different idea is for something in the stack to have a high level tunable knob like:

storage:
  instanceAttached: /var

This would be a MachineConfig fragment.

This logic could potentially live in the MCD; rather than having a machineConfig object describe a specific number of disks, the MCD would dynamically look at the node and say "oh this instance has 2" and set things up accordingly.

The tricky part with this is that simply waiting for "all disks" gets racy. It'd really be best done with knowledge of how many disks to expect. We may need some integration with the machineAPI so we know how many disks to expect per instance type (per configuration) - it feels dangerous to just unilaterally take over block devices we see, though maybe there are provider-specific ways to know they're ephemeral disks.

@michaelgugino
Copy link
Contributor

This machineconfig provided in first comment: openshift/release#8102
And custom worker instructions: https://github.com/openshift/machine-config-operator/blob/master/docs/custom-pools.md

Work together well. All that you need to do is create a custom user-data secret. This can be done easily.

First, find an existing machineset you want to modify, and download the user-data secret.

./oc get secrets -n openshift-machine-api worker-user-data -oyaml > worker-user-data.out.yaml

Modify scrape the secret contents into a new file, base64 decode those contents (you'll get some json), edit the URL near the beginning of the JSON to specify the new mcp name you created. Base64 re-encode that modified json (make sure to use -w0 option so you don't get line wrap), paste that into the user-data yaml file, update the name to something sensible and oc apply.

Update your preferred machineset to utilize the m5d.4xlarge instance type and to utilize the new user-data secret you created. That's it.

@cgwalters
Copy link
Member Author

Modify scrape the secret contents into a new file, base64 decode those contents (you'll get some json), edit the URL near the beginning of the JSON to specify the new mcp name you created.

This part is pretty user hostile I think. But doing better requires much tighter MCO/machineAPI integration; something like having the machineAPI ask the MCO "please give me the user data secret for pool X" or so.

@michaelgugino
Copy link
Contributor

Modify scrape the secret contents into a new file, base64 decode those contents (you'll get some json), edit the URL near the beginning of the JSON to specify the new mcp name you created.

This part is pretty user hostile I think. But doing better requires much tighter MCO/machineAPI integration; something like having the machineAPI ask the MCO "please give me the user data secret for pool X" or so.

@cgwalters

We more or less are doing that by specifying a user-data secret.

The MCO could stamp out user-data for each pool, and users could just update/create a machineset that points to that secret.

@cgwalters
Copy link
Member Author

I think in general a good pattern would be using /var for an ephemeral drive - but to make that work right we need to move e.g. the pull secret into /etc. Same reason we want to do that for #1190

@michaelgugino
Copy link
Contributor

I think /var/lib/containers should be it's own drive. All of /var is too broad and might make the system impossible to recover if that volume is lost for whatever reason.

@cgwalters
Copy link
Member Author

I think /var/lib/containers should be it's own drive.

Agree this is the most obvious thing to start with for OpenShift.

All of /var is too broad and might make the system impossible to recover if that volume is lost for whatever reason.

Remember for CoreOS systems, the system boots with an empty /var - the default files and directories there are populated e.g. by systemd-tmpfiles. We don't explicitly test nuking it though, and actually doing so gets into issues I raised in internal email like the fact that will remove your ssh keys and for that matter the whole /var/home/core directory.

@michaelgugino
Copy link
Contributor

actually doing so gets into issues I raised in internal email like the fact that will remove your ssh keys and for that matter the whole /var/home/core directory.

Right, that's why I think maybe it's not the best idea to use all of /var as an ephemeral volume. That opens a totally different can of worms.

I think if we identify what the primary purpose of using these drives is, it can help us better align what the implementation should look like. For me, the primary purpose is increased IOPS for containers. Since the containers themselves are ephemeral, an ephermal disk seems like a good choice. If a pod needs persistent storage, it should utilize a PVC (static pods not withstanding, etcd, etc).

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2020
@cgwalters
Copy link
Member Author

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2020
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2021
@cgwalters
Copy link
Member Author

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 22, 2021
@cgwalters
Copy link
Member Author

One thing that was pointed out in a chat about this is that today, one can also do this by providing a custom user data Ignition config for the desired machineset.

A core problem today is that nothing really "owns" the pointer data. But, we could make this cleaner if we had an explicit way to include a machineconfig fragment in a particular machineset's pointer data. That would get us out of the MCO having to be aware of this. But, it would mean "day 2" changes for that machineconfig object wouldn't work, or at least would only work for new nodes.

@cgwalters
Copy link
Member Author

Also tangentially related to this, I did do a spike on https://github.com/cgwalters/coreos-cloud-instance-store-provisioner/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants