From 0aaaccc76a24b500c6337c22e1fb6893b20bfbc3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pierre=20Cr=C3=A9gut?= Date: Thu, 28 Mar 2024 18:06:43 +0100 Subject: [PATCH] definition of the HostClaim resource MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The HostClaim resource is introduced to address multi-tenancy in Metal3 and the definition of hybrid clusters with cluster-api. Introduces four scenarios representing the three main use cases and how pivot is made. Presentation of the design will be addressed in another merge request. Co-authored-by: Pierre Crégut Co-authored-by: Laurent Roussarie Signed-off-by: Pierre Crégut --- ...tclaim-multitenancy-and-hybrid-clusters.md | 307 ++++++++++++++++++ 1 file changed, 307 insertions(+) create mode 100644 design/hostclaim-multitenancy-and-hybrid-clusters.md diff --git a/design/hostclaim-multitenancy-and-hybrid-clusters.md b/design/hostclaim-multitenancy-and-hybrid-clusters.md new file mode 100644 index 00000000..dfb17b71 --- /dev/null +++ b/design/hostclaim-multitenancy-and-hybrid-clusters.md @@ -0,0 +1,307 @@ + + +# HostClaim: multi-tenancy and hybrid clusters + +## Status + +provisional + +## Summary + +We introduce a new Custom Resource (named HostClaim) which will facilitate +the creation of multiple clusters for different tenants. It also provides +a framework for building clusters with different kind of compute resource: +baremetal servers but also virtual machines hosted in private or public cloud. + +A HostClaim decouples the client need from the actual implementation of the +compute resource: it establishes a security boundary and provides a way to +migrate nodes between different kind of compute resources. + +A HostClaim expresses that one wants to start a given +OS image with an initial configuration (typically cloud-init or ignition +configuration files) on a compute resource that meets a set of requirements +(host selectors). These requirements can be interpreted either as labels +for a recyclable resource (such as a bare-metal server) or as characteristics +for a disposable resource (such as a virtual machine created for the workload). +The status and meta-data of the HostClaim provide the necessary information +for the end user to define and manage his workload on the compute resource, +but they do not grant full control over the resource (typically, BMC +credentials of servers are not exposed to the tenant). + +## Motivation + +So far, the primary use case of cluster-api-baremetal is the creation of a +single target cluster from a temporary management cluster. The pivot process +transfers the resources describing the target cluster from the management +cluster to the target cluster. Once the pivot process is complete, the target +cluster takes over all the servers. It can scale based on its workload but it +cannot share its servers with other clusters. + +There is another model where a single management cluster is used to create and +manage several clusters across a set of bare-metal servers. This is the focus + +of the [Sylva Project](https://sylvaproject.org/) of the Linux foundation. +Another example is [Das Schiff](https://github.com/telekom/das-schiff). + +One of the issue encountered today is that the compute resources +(BareMetalHost) and the cluster definition (Cluster, MachineDeployment, +Machines, Metal3Machines, etc.) must be in the same namespace. Since the goal +is to share the compute resources, this means that a single namespace is used +for all resources. Consequently, unless very complex access control +rules are defined, cluster administrators have visibility over all clusters +and full control over the servers as the credentials are stored in the same +namespace. + +The solution so far is to completely proxy the access to the Kubernetes +resources that define the clusters. + +Another unrelated problem is that Cluster-API has been designed +to define clusters using homogeneous compute resources: it is challenging to +define a cluster with both bare-metal servers and virtual machines in a private +or public cloud. +This [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html) +proposes several approaches but none is entirely satisfactory. + +On the other hand, workloads can be easily defined in terms of OS images and +initial configurations and standards such as qcow or cloud-init have emerged +and are used by various infrastructure technologies. +Due to the unique challenges of managing bare-metal, the Metal3 project has +developed a set of abstractions and tools that could be used in different +settings. The main mechanism is the selection process that occurs between +the Metal3Machine and the BareMetalHost which assigns a disposable workload +(being a Kubernetes node) to a recyclable compute resource (a +server). + +This proposal introduces a new resource called HostClaim that solves +both problems by decoupling the definition of the workload performed by +the Metal3 machine controller from the actual compute resource. +This resource acts as both a security boundary and a way to hide the +implementation details of the compute resource. + +### Goals + +* Split responsibilities between infrastructure teams, who manage servers, and + cluster administrators, who create/update/scale baremetal clusters deployed + on those servers, using traditional Kubernetes RBAC to ensure isolation. +* Provide a framework where cluster administrators can consume compute + resources that are not baremetal servers, as long as they offer similar APIs, + using the cluster-api-provider-metal3 to manage the life-cycle of those + resources. +* Define a resource where a user can request a compute resource to execute + an arbitrary workload described by an OS image and an initial configuration. + The user does not need to know exactly which resource is used and may not + have full control over this resource (typically no BMC access). + +### Non-Goals + +* How to implement HostClaim for specific compute resources that are not + BareMetalHost. +* Discovery of which capabilities are exposed by the cluster. + Which kind of compute resources are available and the semantics of the + selectors are not handled. +* Compute resource quotas. The HostClaim resource should make it possible to + develop a framework to limit the number/size of compute resources allocated + to a tenant, similar to how quotas work for pods. However, the specification + of such a framework will be addressed in another design document. +* Pivoting client clusters resources (managed clusters that are not the + initial cluster). + +## Proposal + +### User Stories + +#### As a user I would like to execute a workload on an arbitrary server + +The OS image is available in qcow format on a remote server at ``url_image``. +It supports cloud-init and a script can launch the workload at boot time +(e.g., a systemd service). + +The cluster offers bare-metal as a service using Metal3 baremetal-operator. +However, as a regular user, I am not allowed to directly access the definitions +of the servers. All servers are labeled with an ``infra-kind`` label whose +value depends on the characteristics of the computer. + +* I create a resource with the following content: + + ```yaml + apiVersion: metal3.io/v1alpha1 + kind: HostClaim + metadata: + name: my-host + spec: + online: false + kind: baremetal + + hostSelector: + matchLabels: + infra-kind: medium + ``` + +* After a while, the system associates the claim with a real server, and + the resource's status is populated with the following information: + + ```yaml + status: + addresses: + - address: 192.168.133.33 + type: InternalIP + - address: fe80::6be8:1f93:7f65:59cf%ens3 + type: InternalIP + - address: localhost.localdomain + type: Hostname + - address: localhost.localdomain + type: InternalDNS + bootMACAddress: "52:54:00:01:00:05" + conditions: + - lastTransitionTime: "2024-03-29T14:33:19Z" + status: "True" + type: Ready + - lastTransitionTime: "2024-03-29T14:33:19Z" + status: "True" + type: AssociateBMH + lastUpdated: "2024-03-29T14:33:19Z" + nics: + - MAC: "52:54:00:01:00:05" + ip: 192.168.133.33 + name: ens3 + ``` + +* I also examine the annotations and labels of the HostClaim resource. They + have been enriched with information from the BareMetalHost resource. +* I create three secrets in the same namespace ``my-user-data``, + ``my-meta-data``, and ``my-network-data``. I use the information from the + status and meta data to customize the scripts they contain. +* I modify the HostClaim to point to those secrets and start the server: + + ```yaml + apiVersion: metal3.io/v1alpha1 + kind: HostClaim + metadata: + name: my-host + spec: + online: true + image: + checksum: https://url_image.qcow2.md5 + url: https://url_image.qcow2 + format: qcow2 + userData: + name: my-user-data + networkData: + name: my-network-data + kind: baremetal + hostSelector: + matchLabels: + infra-kind: medium + ``` + +* The workload is launched. When the machine is fully provisioned, the boolean + field ready in the status becomes true. I can stop the server by changing + the online status. I can also perform a reboot by targeting specific + annotations in the reserved ``host.metal3.io`` domain. +* When I destroy the host, the association is broken and another user can take + over the server. + +#### As an infrastructure administrator I would like to host several isolated clusters + +All the servers in the data-center are registered as BareMetalHost in one or +several namespaces under the control of the infrastructure manager. Namespaces +are created for each tenant of the infrastructure. They create +standard cluster definitions in those namespaces. The only difference with +standard baremetal cluster definitions is the presence of a ``kind`` field in +the Metal3Machine templates. The value of this field is set to ``baremetal``. + +When the cluster is started, a HostClaim is created for each Metal3Machine +associated to the cluster. The ``hostSelector`` and ``kind`` fields are +inherited from the Metal3Machine. They are used to define the BareMetalHost +associated with the cluster. The associated BareMetalHost is not in the same +namespace as the HostClaim. The exact definition of the BareMetalHost remains +hidden from the cluster user, but parts of its status and metadata are copied +back to the Host namespace. With this information, +the data template controller has enough details to compute the different +secrets (userData, metaData and networkData) associated to the Metal3Machine. +Those secrets are linked to the HostClaim and, ultimately, to the +BareMetalHost. + +When the cluster is modified, new Machine and Metal3Machine resources replace +the previous ones. The HostClaims follow the life-cycle of the Metal3Machines +and are destroyed unless they are tagged for node reuse. The BareMetalHosts are +recycled and are bound to new HostClaims, potentially belonging to other +clusters. + +#### As a cluster administrator I would like to build a cluster with different kind of nodes + +This scenario assumes that: + +* the cloud technologies CT_i use qcow images and cloud-init to + define workloads. +* Clouds C_i implementing CT_i are accessible through + credentials and endpoints described in a resource Cr_i. +* a HostClaim controller exists for each CT_i. Compute resource can + be parameterize through arguments in the HostClaim arg_ij. + +One can build a cluster where each machine deployment MD_i +contains Metal3Machine templates referring to kind CT_i. +The arguments identify the credentials to use (CR_i) + +```yaml +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 +kind: Metal3MachineTemplate +metadata: + name: md-i +spec: + dataTemplate: + name: dt-i + kind: CT_i + args: + arg_i1: v1 + ... + arg_ik: vk + hostSelector: + matchLabels: + ... + image: + checksum: https://image_url.qcow2.md5 + format: qcow2 + url: https://image_url.qcow2.md5 +``` + +The Metal3Machine controllers will create HostClaims with different kinds +handled by different controllers creating the different compute resources. +Connectivity must be established between the different subnets where each +controller creates its compute resources. + +The argument extension is not strictly necessary but is cleaner than using +matchLabels in specific domains to convey information to controllers. +Controllers for disposable resources such as virtual machine typically do not +use hostSelectors. Controllers for a "bare-metal as a service" service +may use selectors. + +#### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster + +The bootstrap process can be performed as usual from an ephemeral cluster +(e.g., a KinD cluster). The constraint that all resources must be in the same +namespace (Cluster and BareMetalHost resources) must be respected. The +BareMetalHost should be marked as movable. + +The only difference with the behavior without Host is the presence of an +intermediate Host resource but the chain of resources is kept during the +transfer and the pause annotation is used to stop Ironic. + +Because this operation is only performed by the administrator of a cluster +manager, the fact that the cluster definition and the BareMetalHosts are in +the same namespace should not be an issue. + +The tenant clusters cannot be pivoted which can be expected from a security +point of vue as it would give the bare-metal servers credentials to the +tenants. Partial pivot can be achieved with the help of HostClaim replicating +HostClaims on other clusters but the specification of the associated +controller is beyond the scope of this specification. + +## Design Details + +TBD.