113


All right. This talk is-- thanks, Vikrant, for covering the benchmark data. So this talk is-- basically, we are going to cover the logical and physical architecture. So we have three speakers. So myself, Reddy, I'll start off very quickly. And then Vikrant is going to cover the mapping of logical architecture to Kubernetes. And then the remaining talk is actually Siamak is going to cover the physical architecture part.

So the logical architecture, we actually decomposed into two basic constructs. This is probably familiar to most of you. So when we look at the memory hierarchy, we look at this into two different groupings. One is the near memory. Near memory is anything that is actually connect closer to the CPU. This is essentially the host DRAM, whether it is on the local node or the remote node, or it could be high bandwidth memory in the GPU. So we look at that as a near memory. Far memory is anything that is actually driven, which has the load/store semantics, but essentially connected over the CXL type of fabric or memory type of fabric. And it has a higher latency compared to near memory. So that's the way we are segregating in the logical architecture. And then last but not least, there is also block storage that forms the hierarchy when we look at NVM over CXL type of media later on.

This is the work that we have done in the CMS project. So I won't go into the details. This is actually captured in the white paper. So at the highest level, there are essentially-- there is the host that is actually consuming the memory that has the firmware, that essentially knows how to enumerate all the devices that are actually connected. And then, of course, there is the operating system and VMM that essentially consume the memory. Like Vikram said, it could be a logical memory only NUMA domain as an example. Then you essentially have the management controller. So we didn't want to call this as the baseboard management controller and make that as an implementation artifact. So in the logical architecture, you see that as a management controller. Primarily, the management controller is there to do the out-of-band manageability. So you can see that there is an agent that is actually connected to the data center fabric manager. So you have a direct attached CXL memory buffer in the architecture. Then you also have multi-headed and then the fabric-based. And then there is the CXL over alternate transport as well. So our focus is to essentially include all those elements in the logical architecture, as well as ensure that there are solution architectures around them. Last but not least, there is also a data center memory fabric manager that stitches all these pieces together and works with the orchestrator to provision, deprovision the memory. So that's the logical architecture in the white paper. You can actually take a look at the white paper for details.

With that, I will hand it over to Vikrant, who will talk about the mapping of this architecture to the Kubernetes.

OK, so I'll talk about how we use the logical architecture that Reddy just flashed. I'll just talk about how we are using that. And I guess the takeaway for me in this slide, it's the whole hardware-software philosophy and how we are embracing that. And I know Siamak is talking about it on every call. And this is sort of bringing all of that together. When I presented the logical architecture diagram to a couple of engineers on the Kubernetes team, they immediately had feedback on which are the top two bullets, that if we really have to consume pool memory or any sort of CXL device, we need to make sure that our service owners do not know that it's a special type of memory. And that was the driving principle or guiding principle around which we built this particular topology. Then I also had help from SK Hynix. The team at SK Hynix also done some prototyping using Kubernetes. Reached out to them, saw how they had done this last year and when they presented at FMS. So all this really helped narrow down this workflow that I'm putting together. And again, this is open for discussion. This is a first attempt. But really, the hardware-software design is key if we have to make this container use case successful using pool memory. One of the other driving factors or the requirements for our Kubernetes team was they want to use the existing semantics within Kubernetes. So cpuset.mem is essential. They don't want any sort of flags that say that it's a special pool memory or it's a local DDR memory. Because that will be a big tax for us to consider pool memory within our infrastructure. So given that, the way I was thinking about this particular topology was you have this memory appliance that registers itself with the Kubernetes master. And this is drawing inspiration from how the storage vendors have done it, where it shows up as a appliance or object class within the Kubernetes master and allows you to essentially create slices of memory within using the knobs in the Kubernetes master. So all these will be bundled and managed and orchestrated by the Kubernetes master. So when you have a pod spec or a config file that is submitted by our service owners and they just ask for memory, it transparently is orchestrated by the Kubernetes master, where it looks at all the topology map or the topology manager that shows the zero-core NUMA domain memory as well as the existing direct-attached memory and figures out how many slices it needs to hot-plug into the host that can support a specific pod spec. So this is the high-level idea. And then, of course, there are the internals that have already been ratified within CXL, like the hot-plug process and how it shows up in the Linux kernel as a zero-core NUMA. All those are already built out. And Reddy, of course, helped me on that as well. So this is where we are. And I'm hoping that as time goes by, we have some more clear definitions and APIs coming from the data center management API workstream that will allow us to embed those within the Kubernetes open-source code. I think that is it.

OK, very good. So far, we talked about logical concepts and aspects and how AI or large memory databases need more memory or a larger number of devices. And we believe CXL is a way to interconnect them. And then we are here at OCP. OCP is the place that we bring technologies into real systems. So today, I talk about a little bit of the system aspects of these concepts.

So we all invite you to join us. Bring your challenges. Bring your use cases. And within OCP, we have a number of work groups and work streams that are attacking individual components of it. Extended connectivity is the work stream that is talking about how to cable these things together, how multiple chassis we might have. Composable memory systems, this track today talks about software elements, benchmarking, how to manage these things. And DCMHS, Modular Hardware System subgroup, is the one that brings in hardware components aspects of that.

We have maybe a minute if there are any questions. We can just do a quick Q&A and transition over to the next topic. Any questions for the presenters? Manoj, do you have a question?

I have a question. Oh my god.

This group called CMS whatever is working on. What would be the next deliverable for this work?

Yeah, so the next deliverable will be just on the architecture specification. So the next deliverable will be essentially looking at the fabric side of the scope and then refine the architecture to include that. This includes AI fabric as well. We don't have that very well covered in the initial spec-- I mean, initial white paper, sorry.

One important piece. Manoj talked about AI systems are requiring different system components. Some of them require more compute. Some will require more connectivity. Some require more bandwidth for memory. Some require more memory capacity. They're all true. One piece I will talk about this afternoon as part of a continuation of this is that fact. We as system suppliers, system architects can specialize for different workloads. But you guys know these workloads change very fast. So another approach is to have a balanced system, a system that's modular, flexible, so that different use cases can land on the same hardware. That will help investment protection for a lot of people, for suppliers and for the consumers.

All right, thank you.