289

YouTube:https://www.youtube.com/watch?v=ksNts1CD6aY
Text:
Hello. I am Anil Godbole. I'm in Intel's Xeon product planning, CXL product planning and marketing group.And I'm also the co-chair of CXL Consortium Marketing Workgroup.I want to thank Frank for inviting me to this MemVerge Memory Fabric Seminar.My main motivation is to really convey to the audience about the big momentum which is building behind CXL and the emerging use cases in the marketplace for the CXL value prop.

So I'll begin by pointing out, right, first of all, the CXL spec really evolved from 1.0 to 3.1 today in just a span of four years. So that speaks for the... the volume of interest behind the, you know, in the CXL protocol. As you know, PCI Express took more than eight years to go from version one to version three. So definitely, there's a lot of momentum. A lot of companies are part of the consortium now, 250 plus. And they are getting all these new ideas. And that's why the spec keeps on advancing like this, right? However, one thing which has happened because of this, right, is the CPU manufacturers, the device manufacturers, CPU and the memory device manufacturers, and other accelerator manufacturers, right? They are... They need to keep up, right? And today, here we are in 2024. Today the ecosystem is really at version 2.0. As you know, Intel is in the process of launching their Birch Stream platform with various CPUs which will support the CXL 2.0 version, right? And on the other side, the device side, the switch makers and the memory device makers are also doing the same as we speak.

So, I want to start by saying, we have done our best; we, Intel, have done our best to keep up with the CXL specification roadmap, right? We are fully, our roadmap is fully aligned with that thing. We were the first ones to introduce CXL at version 1.1 in our Eagle Stream platform. As you can see, our Sapphire Rapids and then the follow-on Emerald Rapids, Gen 4 and Gen 5 Xeon CPUs, actually were the first ones in the market to, at the various stages and everything, to show off how CXL, how the CXL value prop is there for memory addition. And then, here we are in our sixth generation process launch here, the Birch Stream platform. This time, we have full support for all the CXL version 2.0 spec features, right? And beyond the basic memory support, CXL memory support, we call what I call enhanced support for the CXL memory. So, we have some built-in hardware-based features for handling the memory tiering built into our CPU. And that's why I call it an enhanced support for CXL memory. And then, of course, we'll also support the memory pooling, which is a big, which is a big feature of CXL version 2.0 spec. And it's a big feature because now, for the first time, CXL links will travel up from the back of the server, server node, all the way to a card, which is just a bunch of memory, right? So that's a big event. To date, PCI or CXL links don't travel up the rack; only the other Ethernet and storage-related links travel up the rack. So, this is a big step for CXL 2.0; I mean, adoption of CXL version 2.0. And then, of course, in the future, we'll put out next-generation Xeons to support 3.0 spec.

So, I want to point out some of the—just reiterate—some of you probably already know, but the main value prop was CXL attached memory, as I show here. One is the increased memory capacity, right? So, as we know, once you add more memory to the processor, everything usually improves, right? So there is faster execution of any process, right? And now, if you're in a virtualized server, you can have more VMs running on the process. And the one big effect is now your hits to the SSDs are reduced because now your memory footprint is so much bigger than just native DRAM. And then, in today's world, we have all these memory-hungry workloads, which are listed below, like the in-memory databases, the AI/ML, high-performance computing, and then media and anything with, you know, big, big data—data footprint workloads will benefit from this. And the other subtle way is right, to improve the processor's memory bandwidth, right? It is, it is basically intuitive that once you add more ways to access memory, typically the memory bandwidth also goes up, right? But here, we can use some other technique called address interleaving, where we can really make the bandwidth and tune the workload to use the memory bandwidth of the system, which goes up dramatically, not just having a two-tier memory, right? Again, the workload—I mean, the bandwidth-hungry workloads like AI/ML, HPC, and non-relational databases that I can think of will certainly benefit from that. And CXL also offers a way to lower TCO. Maybe these, the CXL memory prices are not really lower today, but CXL will allow DDR4 memory or previous generation memory or cheaper memory like persistent memory to be used behind the CXL controller, right? So, it is hoped that going ahead by using such techniques, the cost of a DIMM, compared to DDR5, would be lower if you do it with CXL memory, right? And then the other big use case, which I won't talk about today too much, where it is about the memory pooling. I mentioned it earlier where you will have just a bunch of memory J-BOM cards within your server rack. And this way, the CSPs can provision only optimally provision their motherboards. There are hundreds and thousands of them in a data center. And even if they can reduce the amount of DRAM—DRAM, as you know, is the single biggest, highest price component on the server bound today—so even if they can reduce by about 25%, they will come out ahead. But for those workloads which need more memory, they can go to this memory pool on the top of the rack, borrow that memory, use it, and then release it. So, these are the three main value props, I would say.

And then a little bit more education. Once you have CXL-based memory, so now natively when you boot up, it will come up as two memory tiers. And there are two ways to handle that. One is a software control tiering option, and one is a hardware control way. Hardware control way, of course, I'll put in a plug for my employer. These are unique to the Intel Xeons today. But the software tiering option anyone can use, any CPU can use. And the first one on the left is about simply, basically handling the second tier, which is a far tier, far memory, the higher latency. So you don't really want to execute too much from there. You use these techniques of hot and page cold movement to fool the workload into thinking all the memory is really local, right? So that's that technique. And the equivalent to that on the right side is the Intel flat memory mode which does the same thing but I think it's more efficient in the sense it only moves the lines and the cache line granularity basis, not wait for the whole 4k page to transfer, and it's done in hardware so there's no dependency on the OS. And then the second technique on the lower left is about increasing the mem, the bandwidth of the memory system, and again as I mentioned earlier, we are using address interleaving here, the granularity is the page level. You know in the OS, the page size is some 2k, 4k, whatever is chosen. You interleave addresses based on that between the native memory and the DRAM memory and the CXL memory while we, we also, our processors have this built-in feature of hetero interleaving, so when you set the BIOS to for the server to come up in this mode, the addresses, the system address decoder will appropriately do the address interleaving for you, right? So again, as I said, these are both unique to Intel CPUs and here there's no reliance on the software, the capabilities of the OS, right? Yes, the OS is still required for these modes because in case there is some error or error handling, all these things always are required but at least the OS is not being asked to move data or do things like that.

And then, to just show how, for all those four different modes, yes, the Xeon will support those modes. And then there'll be... And I'm just showing some examples, just to match with what the arrangement had shown on the previous slide. So, the upper left for the software-based memory tiering, showing an example which Astera Labs has shown on the Supermicro server using the memory machine; how the number of transactions went up dramatically once the CXL memory was used and this page movement technique was utilized. On the right side, we are showing the same thing where we are showing a SAP HANA database running fully out of native memory, and then running half native memory and half on the CXL memory. So, we measured the first performance for the first version, which was, let's say, 100% when it was running fully out of DRAM. And then, when we did this hardware-based memory tiering, we only got 2% performance degradation. But the TCO play... We purposely did it using DDR4 DIMMs from an old system. So one could argue that these DIMMs came for free, especially if you're a cloud service provider, you will have a lot of these DDR4 DIMMs, which will need to be recycled in the coming days. And then, on the lower left, we have shown memory. This is the demo I did with Micron. Both Micron and MemVerge contributed to this Linux patch in version 6.9. So we showed how Llama workload... We got 23% more tokens when we augmented it with CXL memory and we interleaved the pages between the native and CXL. And lastly, the hardware-based memory interleaving technique. Again, I'm showing an AI performance workload related to bonus assessment, how it went up by 23%.

So that is just, you know, basically what I wanted to say to the audience. That brings me to my last slide. I just want to summarize by saying CXL protocol has evolved rapidly over five years, and the ecosystem players are doing their best to keep up. And they have done quite well; we are already at version 2.0 devices out there. And then, with memory-intensive workloads dominating the computing landscape today, CXL definitely has a big value proposition. And yes, Intel Xeon roadmap fully supports the CXL. And then we have certain hardware-based modes, which you know, which do not depend on the OS data movement capabilities. And then yes, CXL protocol is here to stay; it has full support from all the major computing industry players. Go check out the consortium's website. Okay, that's my talk.