Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Azure VMSS orchestration mode VM (VMO) based Shoot clusters #217

Merged
merged 1 commit into from
Jan 21, 2021

Conversation

dkistner
Copy link
Member

@dkistner dkistner commented Dec 7, 2020

How to categorize this PR?

/kind enhancement
/priority normal
/platform azure

What this PR does / why we need it:
Add support for Azure virtual machine scaleset orchestration mode vm (VMO) based Shoot clusters.
Each worker pool will get its own VMO instead its of a shared resource like for AvailabilitySets based clusters.

The Shoot cluster need to be non-zoned and have the annotation alpha.azure.provider.extensions.gardener.cloud/vmo=true to enable VMO which we currently treat as a preview feature.

Details:

  • In case the corresponding fault domain count in the CloudProfile changes then the VMO will be replaced. This will lead to a rolling update of the machines in each worker pool.
  • Orphan Gardener managed VMOs will be removed during the Worker reconciliation. So it is safe to be deployed into existing resource group without leaking resources.
  • VMO based clusters will make use of the Standard SKU Loadbalancer.
  • The NatGateway will be available for VMO clusters. This is a difference to AvailabilitySet based clusters which does not support NatGateway.
  • Accelerated network will work with VMO clusters. We can't enable it for AvailabilitySet based clusters as all machines assigned to it need to be compatible/running with accelerated network support which is we can't archive for existing clusters with a central AvailabilitySet, see Properly Enable Accelerated Networking for AvSet Clusters #101

More details:

  • The Azure SDK is updated to support compute client in version 2020-06-30 (min 2019-12-01 is required to use VMO)
  • VMO clusters will use the vmss version of the vm-controller in the Azure cloud-controller-manager (activated in the cloud-provider config via vmType=vmss).
  • A vmss client has been added to the client factory.
  • Added mocks for the client factory and the vmss client.
  • The Worker controller tests have been restructured partially to allow more reuse for the machine dependency tests.
  • The machineclass chart structure to configure AvailabilitySets has been aligned to the way to configure VMOs.

Which issue(s) this PR fixes:
Intend to fix #61

Release note:

Gardener is now supporting Azure clusters based on Virtual MachineScale Orchestration Mode Vm (VMO) as a preview feature. 
VMO should be perspectively a replacement for AvailabilitySet based clusters. You need to make sure that your subscription is eligible for VMO usage. You can find more details here: https://github.com/gardener/gardener-extension-provider-azure/blob/master/docs/usage-as-end-user.md#preview-shoot-clusters-with-vmss-orchestration-mode-vm-vmo

/invite @kon-angelo

@dkistner dkistner requested review from a team as code owners December 7, 2020 14:10
@gardener-robot gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else labels Dec 7, 2020
@gardener-robot gardener-robot added kind/enhancement Enhancement, improvement, extension needs/review Needs review platform/azure Microsoft Azure platform/infrastructure priority/normal size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) labels Dec 7, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 7, 2020
@dkistner dkistner force-pushed the feature-vmo-per-workerpool branch from 19e9136 to d7bb323 Compare December 7, 2020 14:43
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 7, 2020
Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits from my initial "go through".

@ialidzhikov
Copy link
Member

/test

@testmachinery
Copy link

testmachinery bot commented Dec 23, 2020

Testrun: e2e-sdh4d
Workflow: e2e-sdh4d-wf
Phase: Failed

+---------------------+---------------------+--------+----------+
|        NAME         |        STEP         | PHASE  | DURATION |
+---------------------+---------------------+--------+----------+
| infrastructure-test | infrastructure-test | Failed | 12m26s   |
+---------------------+---------------------+--------+----------+

Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkistner , the infrastructure test needs also to be adapted. Ref https://tm.gardener.cloud/testrun/default/e2e-sdh4d

Other than that I played locally with the change and it worked well for lifecycle operations for Shoot with enabled VMO.

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Dec 23, 2020
@dkistner
Copy link
Member Author

Awesome. Thanks @ialidzhikov for the review.
I will address the changes in the new year :)

Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @dkistner !
Adding one last comment that I missed to add in the previous review.

Add support for Azure virtual machine scaleset orchestration mode vm (VMO) based Shoot clusters.
Each worker pool will get its own VMO instead its of a shared resource like for AvailabilitySets based clusters.
The Shoot cluster need to be non-zoned and have the annotation "alpha.azure.provider.extensions.gardener.cloud/vmo=true" to enable VMO which we currently treat as a preview feature.

- In case the corresponding fault domain count in the CloudProfile changes then the VMO will be replaced. This will lead to a rolling update of the machines in each worker pool.
- Orphan Gardener managed VMOs will be removed during the Worker reconcilation. So it is safe to be deployed into existing resource group without leaking resources.
- VMO based clusters will make use of the Standard SKU Loadbalancer.
- The NatGateway will be available for VMO clusters. This is a difference to AvailabilitySet based clusters which does not support NatGateway.
- Accelerated network will work with VMO clusters.

Other changes:
- The Azure SDK is updated to support compute client in version 2020-06-30 (required to support VMO is min 2019-12-01)
- VMO clusters will use the "vmss" version of the vm-controller in the Azure cloud-controller-manager (activated in the cloud-provider config via "vmType=vmss").
- A vmss client has been added to the client factory.
- Added mocks for the client factory and the vmss client.
- The Worker controller tests have been restructured partially to allow more reuse for the machine dependency tests.
- The machineclass chart structure to configure AvailabilitySets has been alligned to the way to configure VMOs.
@dkistner dkistner force-pushed the feature-vmo-per-workerpool branch from d7bb323 to cf0beff Compare January 12, 2021 16:47
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 12, 2021
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 12, 2021
@dkistner
Copy link
Member Author

Thanks for the feedback. I have addressed it.
Also I added infrastructure integration tests for VMO as they are missing before.

/test

@testmachinery
Copy link

testmachinery bot commented Jan 12, 2021

Testrun: e2e-mw78k
Workflow: e2e-mw78k-wf
Phase: Succeeded

+---------------------+---------------------+-----------+----------+
|        NAME         |        STEP         |   PHASE   | DURATION |
+---------------------+---------------------+-----------+----------+
| infrastructure-test | infrastructure-test | Succeeded | 36m6s    |
+---------------------+---------------------+-----------+----------+

Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging needs/changes Needs (more) changes and removed needs/changes Needs (more) changes needs/review Needs review needs/second-opinion Needs second review by someone else reviewed/lgtm Has approval for merging labels Jan 13, 2021
@rfranzke
Copy link
Member

@dkistner @kon-angelo Can this PR be merged?

Copy link
Contributor

@kon-angelo kon-angelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm too

Let's wait a final confirmation from @dkistner.

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes labels Jan 21, 2021
@dkistner
Copy link
Member Author

Thanks for confirming. Yes lets go on and merge :)

@dkistner dkistner merged commit 9363d41 into gardener:master Jan 21, 2021
@dkistner dkistner deleted the feature-vmo-per-workerpool branch January 21, 2021 11:21
@danielfoehrKn
Copy link
Contributor

Very nice PR!

I could not find the place in the MCM azure machine driver where the VM is created with a reference to the ScaleSet. Can you by any chance point me to it? Thanks!

@dkistner
Copy link
Member Author

@gardener-robot gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/3 Priority (lower number equals higher priority) labels Mar 8, 2021
Copy link

@Chokiis82 Chokiis82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change API change with impact on API users kind/enhancement Enhancement, improvement, extension kind/test Test needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) platform/azure Microsoft Azure platform/infrastructure reviewed/lgtm Has approval for merging size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple (Availability)Sets
9 participants