-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path324
34 lines (17 loc) · 17.5 KB
/
324
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Right. Hello, everyone. Good afternoon. So, this is Vishnu Balraj from Micron. I'm part of the product management team. So, it's going to be exciting today. We'll talk about how to enhance AI and database performance using CXL devices.
It was because of your new label. I didn't...
Yeah, this is a new Micron label, yeah. So, we'll see how to... Specifically, the focus is going to be on the CMMs. CXL memory modules is what we're going to talk about.
So, let's look at the high-level data center challenges we have today. And most of you know about the memory and storage hierarchy, so I'll just call out so that you have a context of it. And following that, Micron has two CMM products. One is CZ120, which we have already released, and it's in production. The other one is CZ122; we'll talk about it. And then, the exciting part of this presentation is going to be... We'll talk about the use cases, and we have some demo booths also outside. We have internally vetted it, and it's been kind of... We're working with many partners on this, so we'll talk about it. And then finally, how to enable CXL in your company. So, those are the topics.
So just to touch upon the data center challenges. So in-memory database, SAS, and then AI, specifically RAG, right? We'll talk about the RAG.So we'll be curious about how CXL plays in AI. So I'll be emphasizing on usage model on specifically on the retrieval augmented generation. So these things requires more capacity. So this is one of the major challenge actually as the compute grows, the memory capacity also grows. The database requires a lot more capacity. That's one of the challenge.And the second one is, so you know that last week AMD released a Turin processor. So the core count is top set, 196 cores.so these compute requires more bandwidth and how do you address the bandwidth of the additional compute need and in the same time you know these workloads require to be optimized for the TCO they demand higher bandwidth and capacity so how do you ensure that you know we optimize the TCO using CXL actually you know how can we overcome these challenges using CXL.
So, if you look at it, this is the memory storage—you know, the hierarchy storage hierarchy—all of you know about it. So, as you go up towards, you know, the HBM, you're going to get lots of bandwidth, and as you go down, there's going to be an increase in latency and increase in capacity also. So, CXL sits kind of in the sweet spot, actually, where you can get the best capacity expansion and then the bandwidth expansion, so we'll see how to achieve that.
So, the first one here—CZ120—this is the product. It's in production now. This is a CMM E3.S 2T form factor; it's a CMM module you can plug in. Later on, I'll show that we have worked with SMC so it can be plugged in on a server. We are offering 128 and 256 GB capacity, and this is based on the CXL 2.0 specification. We are offering the industry-leading low latency and high memory bandwidth; we top at 37 GB per second. This is again in the CMM category, with the PCIe Gen 5 by 8 lanes. The way we measured this was through MLC benchmark: 2 read, 1 write. This is where most of the workloads require 70-30, and we are able to demonstrate that. This is a standard E3.S 2T form factor; it's already in production, so you will be able to get it—it's already launched in production.
So, the second product that we have here is the CZ122. This QS is already done, and we are sampling it, so this product will be available soon. The difference between the CZ120 and CZ122 is that this one has hetero interleaving support as well as metadata support. Some major platform vendors require metadata; with metadata, you can do the hetero interleaving. A lot of them talked about the hetero heterogeneous interleaving earlier. What you are talking about is hardware-based heterogeneous interleaving, and so with the CZ122, you will be able to enable the hardware-based interleaving for all major CPU platforms. From the usage model standpoint, what does it give? You can just plug in, and then you can get both capacity and bandwidth expansion without any software-based tiering or anything. That software-based training can still be done, but you can get both bandwidth and capacity just out of the box. And that's the main advantage of CZ-122. In addition to that, we have the RAS capability, which I've talked about here. So in terms of RAS, you know, we can talk about it in the hour for the features that we are supporting on RAS for CZ-122. I'll just call out a couple of them. One is the internal CVME threshold that we support, where you can program your memory threshold, and once you hit the number of memory errors, it automatically sends a notification back to the OS for doing the page offlining. And this is more of an OS feature, and then we built it into the device to make sure that it seamlessly works with that. The second one is the device-initiated PPR, HPPR. So when the device boots, we'll enable the hardware-based PPR, right? It'll fix if there are any memory errors. So, those are the two features we are offering in CZ-122, actually. So that is in QS done, and, you know, it'll be in production soon. We'll keep you posted on that.
All right, so this one got a little messed up here. Okay, so I'm going to talk about the scale-up with CXL. So, I do have the picture here. This is something that we have collaborated with AMD and SMC Supermicro. So, what we have done here is, if you take a baseline today with 128 GB RDIMMs, you can only run a certain number of VMs. So, by being able to add additional CXL modules, you can run a greater number of VMs, right? That's what we are trying to demonstrate. As the number of cores increases, you can get better utilization of a processor by adding CXL because some of these database workloads require more memory. I apologize for this blank here, actually. I need to check. And so, the second one here we are talking about is RocksDB performance. This one, we have worked with the H3 platform on this one. So, we are introducing a concept called Famfs. This one, I think, it's probably CXL 2.0 is the first time we are enabling the sharing model with the CXL. So, I'm going to show how we can use Famfs to enhance sharing using the CXL and improve the overall system performance. The third one is the RAG, actually. So, this is the, everyone knows about RAG, actually. So, RAG stands for Retrieval Augmented Generation. So, via CXL, you can improve the overall time to first token as well as you can also improve the overall latency of a RAG plus LLM system. So, that's what I'm going to talk about later. And this is something that we are working with MemVerge on this one.
So, the first use case: let's look at this. In this, if you look at it, the VMs are running in a direct attach mode. This is the baseline we're talking about here. We're able to run using TPC-H as a benchmark; we're able to fire up, you know, around 45 cores or so. Each of the TPC-H requires around 34 GB or so. On the direct attach, we had about—it's a 12 channel—so 128 GB and 1.5 terabyte total. Immediately, it cannot run out of it because the number of cores is so high, and it needed more memory to have good enough performance. And so the DRAM is called out as a NUMA-0. It's the baseline, and then the CXL is in a NUMA-1. So, what happens, right, if you look at the graph on the right-hand side? With the DRAM, you're able to do the scale-up factor up to one; let's say so many VMs are fired up and it's all running. Using CXL, you're able to increase the overall utilization, the memory utilization, up to 70%, right, with the CXL plus DRAM. What you're able to improve is 1.7x. But now one can ask, so what happens to the overall performance because the CXL is a low latency. But what we have seen in this benchmark is the purple line you're seeing here—sorry, the gray line you're seeing here—is with the DRAM. And this purple line is with CXL when CXL is added. If you look at the overall performance, when you added that additional VMs, right, up to 70%, your average performance has only gone down by only 6%. CXL latency has not impacted the performance. And overall bandwidth kind of counts in this. Only certain bandwidth counts, and CXL is able to provide that. So, that's point number one. We are able to increase up to 70%. The second point I want to bring up is, you know, this is a scale-up, actually. So, what can one do, you know, if they want to run more number of VMs and each of the VMs, if they want to run database or, you know, data-centric workloads, they can either add a new system or they can scale up using CXL. So in this model, with the current price point, we are able to be increasing by increasing the CXL. We added up to one terabyte of capacity, 256 GB. By being able to add it, your price performance per dollar, right, goes up by 1.2x. If you assume the baseline server is a one. So you can push the CXL by adding more CXL; you will be able to push it up. In some servers, you can add up to eight CXL devices. So you get more memory, and you get more performance. And that's one way to scale up the performance without adding a new server. This is one usage model.
And so, this talks about, you know, what server we used here. So, if you look at it, you know, on the bottom here, you see this is the Supermicro H13 petascale storage server. So, it has got the CXL slots in the front. You can put up to four CXL slots. This is the storage server. So, this is where the evaluation was done. Yeah.
And so, I'll go to the next one. So, this is a RocksDB. So, if you look at the left-hand side, right, this is how traditionally the RocksDB has been run. So, you have storage, where you're putting the database. And now, you have these two VMs running on it. The same database is being read by the RocksDB application. What happens is, in this case, the VM1 and VM2 are having this local attached DRAM. So, they're having duplicate copies of the same database. So, this is traditionally how it was done, and this is a DRAM-only system. And in this system, we have CXL. So, this DRAM is there, but DRAM is not used. Not used in the sense I'll explain it. So, what you're doing is, this is where Micron is introducing a concept called Famfs. So, Famfs stands for Fabric Attached Memory File System. So, the idea is, you know, your DAX concept, right? DAX concept, direct access to the memory. So, the application can directly access the memory. Taking the DAX concept and then file sharing, both are merged into something called Famfs. So, that way, each of these applications can, the entire database is loaded on CXL modules. It makes one total of one terabyte here. And each of these databases, right, this is, again, read-only. It's not a read-write, it's a read-only. So, they're able to avail this directly without going and buffering in the local DRAM. So, with that, right, you're able to get much better performance, actually, when the databases are large. And then, the other important point on this, there is no, this is deduplication done. So, by that also, you're able to save. And this we have demonstrated with four CXL, four CXL E3.S in a server. So, one can also enable this with multiple nodes, where you can have a switch, and you can have the CXL 2.0 devices up to 5.5 terabyte, and then you can share it in the multiple nodes. So, that's another huge advantage. So, I'm going to show you the benchmarks that we have done, right? So, this Famfs is something that Micron has taken the leadership on, and we have pushed the patches in the kernel. And our engineers are working on it. And in about, I cannot tell the exact timeline, but in the future, we will have a kernel available with the Famfs for all the customers, our customers can use this.
So, here I will show you the results of the benchmark that we have done. So, on the top, you see, this is the operation per second; it is the overall performance that we are looking at. So, the orange line that you are looking at, that's the DRAM plus SSD, right? So, the x-axis here is the DB size divided by DRAM size. So, what we have done is we have taken the DB size and database size, and we have increased it with the first system. So, the performance is really good; the ops per second is really good. And then, what happens suddenly, you know, it goes for a toss when the database size goes beyond the DRAM size. Because your DRAM size is small enough because you are duplicating it; it's used as a cache, and then it goes to the storage. So, you know, it really lowers. But what we have found interesting is the blue line. The blue line you are seeing is two CXL cards, and the performance is low. And the yellow line that you are looking at is the Famfs with four CXL cards. The Famfs with four CXL cards performance is equivalent to what you have in DRAM, in terms of the operation per second. And since it's all set in a single CXL shared memory, you can increase the DRAM size, and you know, you are able to get the same performance, you know, sustained performance over the period of time, right? And here again, in this case, you know, there's a P99 latency. By the way, this is a logarithmic graph, so initially, DRAM latencies are better in the first few databases, when the database size is really low, because you know, DRAM is faster. But as it goes along, when you increase the database size, again, this logarithmic scale, right, this P99 latency goes for a toss for DRAM. But your CXL is exactly consistent. So, this is the best part where you are not actually doing a buffer copy also. Therefore, everything is sitting on CXL, and many different DBs, RocksDB applications are able to use it. This does not require any changes to be done on the RocksDB also, so it can be just working. And there is also deduplication that saves overall system power. We haven't quantified that, but there are a lot of benefits, and you know, this is possible today, right, with the direct attach, as well as with the multiple nodes.
So here we have, we have collaborated with H3 Platform. And, as I said, it's a one terabyte total. This one goes up to 5.5 terabytes overall; you can take 256 GB and attach it, with the switch, to many nodes, and many nodes can use the same database.
And so, this is the last but not least; actually, I am going to talk about this usage model here. So, RAG stands for Retrieval Augmented Generation. This is becoming more prevalent in Enterprises. The idea here is, you know, with LLM being trained sometime, you know, like last year or last couple of years ago, everyone wants to always have the latest and greatest data, right? So, RAG is useful in that one scenario. The second scenario where RAG is being used is when enterprises have their own datasets, and they want to take the dataset and put it in a RAG system. They want to do a similarity search and add additional context. The additional context is given to this LLM, and LLM is able to take it and then it's able to generate the new answers. So, this LLM is running on, let's assume, this is running on a GPU, right? So, what we are talking about here is, how can CXL help with the RAG pipeline? Because overall, the performance of this is going to be measured by TTFT, right? That's one of the performance metrics that is being measured, which stands for total time taken. These are the first token, and the second one is the total latency. So, if your RAG latency increases, your TTFT and the overall latency are going to increase. So, it's important to optimize this RAG. So, that's the point one. Now, this RAG is today being run on CPUs. The reason is it's not compute-intensive. So, with that, right, with the CPU, we already have CXL. And like, what are the challenges that I said earlier, RAG requires large databases, because we are talking about a vector database here. So, we are taking the existing dataset and converting it into embeddings and then indexes. It kind of blows up in the size. So, we have seen, you know, some of our internal analysis indicate, you know, that this size could be going up to six terabytes or eight terabytes. So, these are all sitting in the storage. The more you push it into the CXL, it's going to have better performance. And so, that's how, what you do is, you know, you're able to reduce this overall latency when the user query comes with the retrieval search, this is going to add latency to that; the user gets impatient, right? So, this is where, you know, we are looking at adding CXL and improving the overall performance, AI performance. And then this part of it, you know, it's still on HBM, and, you know, we will look at it in the overall GPU, right? But, this RAG part is what we are talking about here. So again, this one, we haven't quantified the exact results of, you know, how much it improves. So, we are planning to put together the data. We are working on it. And hopefully, by SuperCompute, we'll be able to share some results. But this is another real strong use case where we think our CZ122 Micron modules will be able to improve the AI inference latency and the response time while using RAG.
All right, so, okay, now actually, this is more of a call to action. So, if you're further interested in, you know, Micron CXL devices, CZ122, we have a website—you know, micron.com/cxl—and we have a program called the TEP program, Technology Enablement Program. We would suggest you sign up for that, and we'll get you the support, and we'll be able to get you the collaterals, the data sheets, and all that good stuff. We'll work with you to enable CXL in your company. So that's all I have, actually. Almost 20 minutes. Any questions? Can I take any questions?
Thank you, Vishnu, and also thank you for being co-sponsoring this event.