277

YouTube:https://youtu.be/anjz7auFFjk
Text:
So, we are talking about how the overall ecosystem looks like when it comes to managing composable memory. And this is in the context of how the CXL is progressing as we go from 1.1 to 2.0 to 3.0 and beyond. So, most of my discussions will be mostly architecture and sort of workload related. I won't be talking about anything product specific. Thank you.

So, we will talk about how the, what people call these days as a data center of the next generation, how that is shaping out to be. How, you know, the traditional server-based architectures are giving way to more composable, you know, software applications. Addressable sort of architectures. And how CXL will play a role in creating fabrics and sort of ensuring that you have the optimal use of the hardware resources when it comes to deployment at scale.So, let's jump straight into it.

So, basically, what is CXL trying to do? You know, we are trying to make data appear closer. Or we are trying to take compute near data and then basically act on it sooner. So those are the two major trends that we see when it comes to the CXL interconnect. And this, in turn, will lead to a whole new set of accelerators—a whole new set of, you know, GPU, FPGA, you know, DPU technologies, which will make use of CXL-based connectivity to access data, get learnings from it, and process it independent of the server CPU.What it also means is, you know, some of the questions that were asked earlier: You don't need to have one type of drivers for managing storage, one type of drivers to manage, you know, persistent memory, something for DDR5, something for DDR4. It all has to be homogenized so that you are able, you know, the application should see this thing as completely transparent. That there should not be a need to change the application because you managed to add in a module for CXL or attach something NVMe over Fabric. It should be completely transparent to that.So, that's the goal. And as we march towards it, you know, basically the whole ecosystem has to play a role to bring it all together.

So, left to right, this is just a progression of how we do it. Direct attached memory to the socket—everybody knows about it. There is only so much you can scale using it. I mean, some of you must have done DDR controllers and layouts, and, you know, expanding it to 20 channels is crazy.So, what's the next best step? You can do some sort of a scale-out memory. This is where the first generation of CXL comes into play. And so, you can do that. And this is where, you know, hyperscalers and enterprise and HPC workloads have specific, you know, bandwidth per core or capacity per core numbers. And as the number of cores increase, these are the ways in which you can manage those requirements. Like, if someone says I need a workload which requires, you know, 5 gigabit per second, then, you know, if you can't meet that bandwidth, then that application suffers.And hence the scale-out memory using direct attached CXL-capable buffers is the first step.Now, things get really exciting when you talk about pooling and disaggregation because, in that scenario, what happens is not only are you able to meet the application performance, you are able to meet it most optimally. So, the earlier speaker talked about how we can tier DDR5 with DDR4. You know, this is one way of sort of mitigating the costs of DDR5. Or how do you manage to tier, you know, maybe some sort of, you know, low latency flash. And what that will do is your BOM cost of the system goes dramatically low. And whatever is the impact to the application, you know, now you have a sliding scale. Like maybe I get hit by 5%, but my BOM cost dropped by 30%. Is it acceptable?The other thing we should also note is that many of these applications have always run on two-socket systems. So the best NUMA hop they have seen is one hop away. Maybe they don't need that latency. Maybe it is something that they can tolerate higher latencies. All of these things have to be tried out. All of these things have to be tuned and made acceptable. And again, this is a Siamak was saying, you know, it takes a village. You know, the amount of stakeholders who show up in the CXL ecosystem, this is not something which one vendor can do alone.

So coming to runtime. Runtime memory management, you know, so as I mentioned, we had direct attach, scale out, and pooling. And that's how the system memory composability sort of increases. Now there are various ways of doing it. You can do memory tiering and page migration. We'll talk about that. Multi-type memory management, I think some of the options on multi-type memory have been getting whittled down recently, but, you know, you can use SSDs or you can use NVMe over fabric to sort of create that. And finally, there have been many public papers where, you know, folks have talked about how memory has been left stranded, how memory is not able to be optimally attached to a given server.So how do you borrow that memory? How do you allocate that memory on demand as required and in general sort of reduce the overall, you know, capex costs for a data center? 

So when I talk about tiered memory, it's, you know, we can call it hot, warm, cold or hot and cold. One way of mitigating the longer latencies is to do some sort of page migration. Where you are able to, you know, sort of detect if a particular page is looking hotter, in which case move it to closer memory, closer in latency. If a certain page is not getting many accesses, then you kind of say that, you know, let's demote that page to a colder memory tier.

So this is one way of describing it. I'm in marketing, so I'm not going to go too much into this. But, you know, enough to say that, you know, we have these, you know, sort of the static resource allocation tables and the heterogeneous memory attribute tables which tell you what type of memory it is, what type of, you know, bandwidth it is, you know, how much latency it is, and then accordingly allocate the processor accesses to that type of memory.

And so when it comes to tiered memory, we talk about, you know, page migration. And one of the ways of doing it is either you give the control to the software or you sort of do it in hardware. There are pros and cons to both approaches. If you do it in software, you know, typically your hypervisor, your application has a far better understanding of when it is seeing a slowdown in performance, in runtime performance. On the other hand, if you do it in hardware, then the performance is much better. It may lead to unexpected behaviour at times, but both are the approaches which can be used. And we have seen both types of approaches being explored in the ecosystem.

So when it comes to, like, page migration, typically, you know, what do you do? You track the accesses going to a particular page. You see if there are page misses. You see if there are, you know, errors. You know, hits. And in general, sometimes it's the hypervisor, sometimes it is some kernel level module which determines does that page get to stay in its hot or cold place or does it need to move. You know, whether you do page migration also depends on how long is that application running. I mean, we have heard of, you know, microservices and, you know, this type of, you know, serverless use cases where applications live for seconds. I mean, there might be scenarios where before you can migrate the page, the job is done. So again, it's a process which we have to go application by application and workload by workload to figure out if page migration will indeed be useful or how does it work to improve the performance.We have a thing we have also get asked about is, you know, what is the security in this whole process? I mean, when you are talking about, you know, people are used to a core and a DRAM attached to it. And that is part of a trusted stack and, you know, if you are hosted on public cloud, you know, the cloud host doesn't know what application you are running, and that you those destroys the same privacy has to extend when you are dealing with CXL attached or fabric attached memory. That particular secure environment has to be extended to the DRAM, to the direct attached CXL buffers, or over to fabric. And that shows up as a lot of use cases where a lot deeper dive is required and a lot of debug is required.

So in this particular case, I would say the right-hand side sort of represents the simplest way in which you could allocate a memory pool. It's a multi-headed device, and you have two hosts accessing distinct regions in the memory. Fairly simple. And that should be the first step. It's a solid step that the whole ecosystem should take to ensure that, you know, a memory pool can indeed be created. The middle picture is sort of a more generic view where you have a switch, there are CXL switches around, and you are able to have multiple hosts and multiple targets in terms of endpoints, and all of them can be accessed from any host. And, naturally, this leads to other considerations such as loaded latencies and buffering of memory and, you know, how many reads to how many writes and how the switch handles it.The fun really starts in the first picture where now you have multiple tiers and multiple hosts, and now, you know, when you have so many applications accessing different types of memories. What are the read-write patterns, you know, what are the latency requirements? All of these become, you know, harder and harder to quantify. And again, this is the place where more application-level tuning is required, more sometimes you may have to put bounding boxes on, you know, these are the four types of applications that work on that particular system. Many of the enterprise use cases are fairly limited. Okay. They are limited in the type of applications they use. So that's one way to sort of achieve a high-performance system with pooled memory as you take the initial steps.

There are new functions or new devices or new modules which will be required in hardware and software. As we look at runtime memory allocation. Memory allocation and pooling, you know, there are, you know, some people talk about hot plugging of memory, you know, that's a very, you know, a use case which exposes a whole bunch of, you know, it shows how well compliant you are with the CXL spec. What also happens at times is that there are certain devices which may be physically attached, but they may change their functions. In that case, even though the hot plug is not a physical hot plug, it still looks like a logical hot plug. And as a result, you know, you still have to deal with the fact that there was memory which was visible as part of the system and suddenly it has vanished and maybe it came back as a memory of a different type.So, you know, just... This is something which is, you know, we had software vendors talk about it, we had memory vendors talk about it, so this is something which has to be all pulled together in terms of some sort of a bake-off or some sort of a demo where everybody benefits from the learnings that have been gained as part of the CXL ecosystem. 

And then finally what I would say is that the goal is to get to composable and disaggregated modes not because it's cool architecture, it's because it solves problems related to costs, performance, and power. And if there is money to be saved or money to be made, you know, people come together and they solve the problem.And what we want is... You know, many, many use cases. We just don't want a cloud focus or a HPC focus. You know, I have searched high and low for, say, a telco use case which required memory or a fintech use case or something related to healthcare, so we have to look across the spectrum in terms of verticals and applications to figure out what will work for CXL and what won't. And finally, you know, this is true, as, you know, hardware vendors are extremely aware of, you know, you cannot pay anyone enough money to change software. So you have to work by keeping the application as it is. And many people have tried that option of I'll give you more performance if you change your application, and it never does really work.So, application transparency... It shouldn't matter whether you're using CXL or not. It doesn't matter. It doesn't matter whether you're running on direct attached memory or some sort of scale-out memory or you're completely running out of a memory which is attached to a fabric. It should still look the same. And, you know, so that's again the goal when we look at runtime composable memory.

I think that was my last slide. If we have time for questions, I can take it. All right. Thank you. Goodinnen? Wait. Please. Just a moment. Okay. Oh, whatever. So, sorry. 

So is AMD is going to look at some more development kit like other vendors are looking at for memory pooling, tearing, optimizations of software stack on the host? Is there any plans for AMD for another development kit?

Of course. I really can't talk product specific, but, you know, we're difficult to. It's actually important. We are an active member of the consortium, and we see where the trends are going. 

You mentioned on the disaggregation and the composable part of it. You mentioned security, but from a security point of view, what do you think are the problems to be solved? At least is the consortium looking at solving the problem? Is it access control? Is it ensuring tested execution environment and things like that? Is the consortium already solved it or is it all to be solved? What about also like link encryption, maybe I asked the question earlier. 

I really don't know. I just know that just the way when you boot your server, you have to have a lot of data and when you say I have so much amount of DRAM and then I have so much amount of CXL memory, there should not be any hesitancy from anybody saying, well, my data is residing in CXL memory, so is it safe? Is it good? I think it should be treated on par as regular tier one local socket attached DRAM. That's the goal.Again, I—

So you're saying it's more of a hardware platform? It's more of a hardware platform service to assure security of the components and then the software will just consume it. 

Correct. And sometimes there are cases where people ask whether is PCIe based IDE, will you use link encryption or will you use your own technology? And these are sort of problems in flight. I don't think we have the final answer on all of these. But in general, from a server platform point of view, most of the ecosystem looks towards the server vendors to figure out how the security is being determined.

Okay. Thank you.