256


All right. So I'm going to switch hats here,  kind of like Jay did earlier. So I'm going to give a presentation instead of being the host this time. So I'm going to talk about CXL and NVMe collaborating for computation. It's going to tie a few things together that we've talked about today. But before I begin, I want to give credit to a number of folks. Certainly, I won't call everyone out by name,  but there are many people both in this room and likely who will watch the video,  who have been working and meeting over the last couple of  years to put a lot of these ideas together that I'll share with you. So not all my ideas,  but instead of groups. So I definitely want to give credit to the groups. So you know who you are, so thank you.

All right. So briefly on the agenda,  I thought I'd start with just level setting on computational storage. I know Gary gave us a good presentation earlier,  but I'll just talk very briefly for anybody who isn't familiar with some of  the architecture work that's been done in SNIA and  also the implementation that exists in NVMe. Then jump in as quickly as possible into combining CXL and NVMe,  and then take a look at use cases.

So maybe many of you have seen this. This is three different architectures that the SNIA computational storage,  TWG has defined. On the left, we've got the computational storage processor. So this is a device that has computational storage resources, but no storage. In the middle, we have the computational storage drive. This has actual device storage in addition to those resources. On the right is a computational storage array,  which is an array like you'd be familiar with. It has computational storage resources. It may even have drives that are  computational storage drives in addition to regular drives. So one key point,  and it may be difficult to see because it's pretty small even from up here,  is this says fabric that's connecting the hosts to the device. The architecture doesn't specify what that fabric should be,  which is great news because then that allows us to use NVMe or CXL,  or Ethernet, or whatever we want. So the other thing that you'll probably notice is that all three of  these diagrams have this teal colored block with some sub-blocks. It says computational storage resources. I realize again in the back, it may be difficult to see.

Well, here's a blow-up of that block. So I won't touch on all these. I'm just going to hit a few of them just  because they will come up in the rest of the presentation. So the first one I'll touch on is a computational storage function. So this is the function that we want to execute. So it's abbreviated CSF. Computational Storage Engine. This is the engine that, it's the resource that can execute a sequence of commands,  or be programmed to execute a sequence of commands. So you can think of this as an example would be like a general purpose CPU. Certainly not the only example, but that is one example. I'll skip CSEE and jump down to FDM. So your function needs memory,  either to store the function itself that's executing,  so the program, or to store the data that is being processed,  either user data or the output from that. And we may have a large block of that memory,  and we want to be able to partition it  and allocate some of it to a specific function. So we've got the notion of allocated function data memory.

So with that, that's the architecture that SNIA has defined. NVMe took that, or the same people who are in SNIA took it to NVMe,  and then went forward with what is an actual implementation,  what is the fabric going to be for NVMe. Obviously, we know NVMe is the interface of choice for SSDs these days. So this is a very, very high-level view of what NVMe then implemented. So in NVMe, there are two new command sets. The first one is the computational programs command set,  and it introduced a compute namespace. And then the second one is the subsystem local memory command set,  and that introduced a memory namespace. And so if we look at this block diagram that's shown,  there are now three namespaces that exist in a computational storage drive  that uses NVMe. So we still have our NVM namespace  that has the storage media that we're all familiar with,  and then we've got these two new ones. We've got the subsystem local memory namespace and the compute namespace. The compute namespace has the function or the program that we want to execute. It also has the ability to access the memory. So in this case, it's the subsystem local memory. It can do that on a DWORD basis,  as you would expect a program to be able to do,  and then perform operations on the data. So obviously, a few different names, but they mean the same thing. So we've got subsystem local memory in NVMe. It's the same as the function data memory  that I shared on the previous slide. We've got a computational storage engine is a compute engine in NVMe, same thing. The CSF is a program. FDM, I already touched on, and then the AFDM is a memory range set,  which is a way to partition the memory in subsystem local memory. The presentation won't get into that,  so I'm not going to dive into those details today. Same thing with device storage is the NVM namespace. So that was a very whirlwind tour of computational storage. And so if this is new to you and you want to get more depth,  feel free to grab me afterwards. I'll be glad to do that. But I didn't want to spend a lot of time on that.
 
So let's move into the main gist of it here, the combining CXL and NVMe.

So Kevin earlier had kind of hinted at this in his presentation,  you know, that maybe there's, you know,  he's not certain if there's a lot of value in it. He is one of the ones contributing, by the way. So he's very familiar with what we've been doing and what is in here. It's unfortunate that he had an urgent matter and had to take off. Definitely not going to give an overview of CXL. Kevin did a fantastic job of that earlier. But I did want to point out to you that he mentioned  that type two devices would be a great choice for computational storage. Completely agree. But I think that that's going to require more development. Type two devices are still very, very immature at this point. There's really no ecosystem. Some people would even say that there's more development needed. I don't... I just think it's going to need a lot more time. There's a lot of complexity there. Ecosystem development will have to happen. What I'm going to show you is everything using type three,  which is what Kevin had mentioned is kind of the more popular type  that everybody is implementing today. And so everything that we'll look at moving forward,  please think type three. So that's CXL.mem. Yet we're still going to be using it with computational storage. So type two will come later.

So why bother combining CXL and NVMe? Well, a couple of thoughts that came to mind as I was putting this together. And actually, we heard it earlier. Jay mentioned it. I think a couple other people mentioned it. Memory and storage are converging. And so, you know, if CXL is kind of the popular way  or the way moving forward to interface with memory,  then shouldn't we also be considering how to use CXL  in a storage-type environment? And I think the answer is yes. And I think computational storage is one great use case for that especially. And so let's get into that. Let's talk about that further.

So a few questions that kind of have been kind of in my mind  and the group as a whole as we've kind of worked through some of this. So what if the host could interact with the CSx using load store semantics? So instead of doing block transfers to the device,  you actually, you know, could do load store right out of the CPU core. All right, let's think about that. So what if the host could be coherent with the CSx memory? So we've got memory on our device,  and we want it to actually be coherent with the host. So it isn't just private memory any longer that the device has. It's memory that is coherent. And as a matter of fact,  what if that memory could be an extension of the host? So now it's part of the host physical address space. And we're actually going to take advantage of that  and use that to complete work,  to do things that we need to get done without having to do data transfers.

All right, so what does CXL bring to the table that we don't have in NVMe? So certainly the coherency is a big one without question, right? We have the, you know, the ability to have the two different memories  that are maintained coherently automatically by the protocol. CXL brings, you know, lower latency, fine granularity, accesses. I know Kevin said he wasn't sure about some of the performance. I think that definitely we might have to do some modeling. We might have to get with Adam over here  and see if we can't do some modeling and figure that out. But I think that, you know, could be very interesting. I think there's potentially a lot of value. And as we know, CXL.mem allows us, you know,  to allow someone to do a load store access directly to a memory. And how is this different from CMB and PMR? Because you may be wondering that as well. Hey, we've already got that. Isn't that sufficient? Do we need any more? So I believe Kevin actually talked about it earlier. So CMB and PMR, you can only have load store access over PCIe  using uncached MMIO space. So it's not cacheable. It's not going to be coherent. So, you know, we don't have those benefits  that using CXL would provide. And then I have the last bullet here about it  may be more efficient than PCIe. You know, Kevin was questioning that. You know, certainly more work to be done to kind of confirm that.

I have a very basic slide here on coherency. I'm not going to insult your intelligence. I think that we probably all are aware  of what that provides us in terms of the same view of memory,  the same view of shared data. We don't have to have as many copies  or maybe not make any copies at all.

So what would we need to actually be able to do this? So I've kind of hinted at it already. We have to be able to address the memory  that's down in the device. Currently, that memory that exists there today  in existing drives is private. It's not available to the host. So we want to be able to share that. And then we want to then be able to use that memory  with, you know, CXL.mem. So if we could do that,  we could still use all the NVMe commands  that we all know about for computational storage,  including initiating compute  using the computational programs commands.

All right, so let's move into use cases. So hopefully with the use cases,  it will become more clear, kind of the vision here.

So before I walk through this use case,  let me kind of explain what we've got going on  with the diagrams that are here. Because there's a few,  this picture gets repeated several times  over the next few slides. So it'll be helpful to understand exactly what's going on. All right, so at the top, we've got the host. You know, that's obviously our host system. It's got CPU cores, CPU cache, and our CXL interface,  and host local memory here. Got a CXL fabric that's, you know,  connecting the two together. And then we've got our computational storage drive  down here at the bottom. So we've got a CXL.io interface with NVMe support. We've got a CXL.mem interface, a coherency engine,  'cause we're going to need to maintain coherency,  of course, with whatever we're talking to using CXL. We've got our three namespaces that I spoke about earlier. We've got our NVMe namespace, our SLM namespace,  and our compute namespace. So this is an NVMe computational storage drive,  exactly as, you know, NVMe currently envisions. And then I've kind of split out here the SLM media. So this is the, you know, the media for SLM,  the subsystem local memory. And in order to make it clear as to what's going on. All right, so then over here on the left,  we've got a picture  of the host coherent physical address space. So the blue portion up at the top,  this is the host local memory. So that'd be this memory here that, you know, the host has. And then the light orange color is the SLM memory  that's CXL addressable. So the arrows, the colors of the arrows  do have significance. So a green arrow like this number one, that's a CXL.mem. A red dashed arrow is a CXL.io or NVMe type operation. And then the blue is internal to the drive. So like these over here. All right, so that's the background on the picture. So all of the next few slides  use this very similar looking picture. Hopefully that'll help. So in this first use case,  we're gonna post-process data before writing it to storage. So the drive is gonna receive some data. We're gonna perform some kind of operation on it. And then we're gonna write it to the media. So the value proposition that this brings  is avoid copying data, you know, with DMA,  and potentially lower latency with the CXL. Our configuration for this case  is we've got an input data buffer  that's in the SLM memory that is CXL addressable. So this is our input data buffer here in the CXL portion. So that's this one over here. And then the output data buffer  is just in traditional SLM. So it's not host addressable. It's just, you know, private to that device. And there should be no reason why we couldn't partition,  right, to have that capability. So basically, if we kind of walk through the example,  then the application is gonna do a write  to the input data buffer using CXL.mem. So at the end of that,  some or all of the data may still be in the host cache. That's okay. Then when the host is done writing that data,  it'll issue an NVMe execute program command  to the compute namespace. And then, you know, the intent obviously  is that the compute namespace is gonna operate  on the data that's in the input buffer,  write it to the output buffer,  and perform that manipulation or operation  that we want done before we write that data to storage. Well, because the input data buffer  is in the CXL addressable portion of our SLM,  we can use a CXL back invalidate snoop  that's part of the CXL 3.0 protocol  to make sure that all of the data  that could still be in the host cache  gets flushed out to our device,  gets sent to our device. And we've got the latest and greatest data,  and now we can actually process through it. And so the computational program just sits there,  churns away, does the computations,  writes to the output buffer. When the execute program command completes,  obviously the host gets notified. The host can then issue an NVMe copy command  to copy that data from the SLM namespace  to the NVM namespace,  which writes it to the storage media. And that concludes the operation of writing the data,  processing it, and then making certain  it actually gets put into the storage.

So use case number two, very similar,  but kind of the opposite. So we're gonna do a pre-processing of the data  before sending it to the host. So in this use case, we're going to,  first of all, we have the same value propositions as before. Our input data, our configuration is flipped, of course,  because now we've got our input data buffer  just in ordinary SLM. And our output data buffer  is in the CXL addressable portion of SLM. So we're gonna use an NVMe copy,  and let me say it correctly,  NVMe memory copy command to copy the data  from our NVM namespace to the SLM namespace. So it reads the data out of the NVM namespace  and writes it into our input buffer,  which is just an ordinary SLM. Now that that, when that completes,  the host now knows that the data is available  in that drive to be processed. So it can issue an execute program command  to the compute namespace. The compute namespace will then begin  to read through that data  and write it to the output buffer. Since the output buffer is in the SLM portion of,  the SLM namespace portion that is CXL addressable,  it has to make sure that there isn't any residual  in the host, it has to do an invalidate,  a back invalidate up into the host  to ensure that there's nothing up in there  that would get out of sync, writes the data. And when that execute program command completes,  at that point, the host can simply come  and do a read of the output buffer  because it's an extension of its own memory. It can just do a load operation directly from there  without having to move it.

So case three is a combination of those two. So I'll try not to bore you with the same details too much. The one additional value proposition  that this one brings is the ability  to have general purpose compute offload. The picture shows a computational storage drive,  but this could easily be a computational storage processor. And the next example is a computational storage processor. So our configuration is both our input  and output data buffers are in the SLM memory  that is CXL addressable. And so it's gonna be, like I said,  a combination of the first two. The application is gonna write to the input data buffer  using CXL.mem. We're gonna use an execute program command  to perform an operation on the data,  writing it to the output data buffer  using the CXL 3.0 back invalidate snoop. Make sure that the input data buffer  is coherent with the host. The output data buffer is coherent with the host. And when the execute program command finishes,  now the application can just come read the data  directly out of that memory.

All right, let's take a look at use case number four. This one's maybe a little bit more interesting. So we've got data post processing with a standard SSD. So we're gonna do something very similar  to use case number one,  where we're gonna post process data  that's been written to the drive. But then after we're done,  we're gonna write it to a different drive. And in fact, as I mentioned in the diagram here,  it's no longer a computational storage drive,  it's a computational storage processor. Otherwise the internals are the same. So similar to the previous example,  both our input and output data buffers  are in the SLM namespace that's addressable by CXL. And so with that,  we're gonna do something very similar as we did. Application is gonna write to the input data buffer. We're gonna issue an execute program command. Again, using the back invalidate snoop,  we make sure input data buffer and output data buffer  are all consistent with what's happening in the host. We process the data, write it to the output data buffer,  execute program command completes. When that execute program command completes,  now the host is gonna issue an IO write  to the SSD NVM namespace. So it's gonna issue a write over here to this drive  that's sitting next to it. And it's gonna provide the data pointer,  or it's gonna populate the data pointer  and point it to the output buffer in SLM. So it's gonna point it to this output buffer,  which it can access  because it's just an extension of host memory. So at that point, we're gonna,  it will perform the write operation. We are gonna use PCIe, or PC, yeah, PCIe UIO. Kevin mentioned it earlier in his talk,  maybe didn't have a lot of details on like he,  I know he kind of glossed over it a little bit  is in the interest of time,  but we will use that UIO capability  to do essentially a peer-to-peer transfer  from the HDM space in the output data buffer  over to our SSD. And the huge value proposition that this example brings  is that we were able to do a peer-to-peer operation  completely bypassing the host. We didn't have to send it up through the host and back down  as Jay had mentioned in his talk about tromboning. We have avoided that. And that's exactly what we wanna do.
 
All right, so let me kind of conclude here. So I believe that CXL and NVMe can be used simultaneously. CXL brings load store access to SLM or could bring it. And it would also enable the host  and the CSx to share data coherently. Meanwhile, we can still support the existing command sets,  especially computational programs,  all the computational storage work that's been done. Some of the benefits, so certainly coherency,  potentially lower latency. We avoid copying. We didn't have to use buffers in DRAM  in order to do all this data movement. And we bypassed the host for data movement. We were able to take advantage of a peer-to-peer operation. So I do think that these two technologies  are on a path that will intersect. They will converge at some point. It probably is down the road a little bit further  than where we are developing today,  but hopefully we all start thinking in that direction. Certainly one step to enable that convergence  and collaboration would be to have SLM support CXL. That doesn't exist today. We'll see if that actually happens or not. But that would certainly be one step  to enable that as a capability. All right, so that's all I have. Thank you.
 
Very nice. This matches the substrate thing almost perfectly, right? The substrate gets pushed down. The query runs. Almost never is there enough resources  to run the whole query in a single device  or even in a CSP or a CSA. It's always distributed across  a hundred or a thousand of these things  in the real world, right? And so any sort or group by  or any higher level operations  that you have to do in a query  can't be done at the device. There's no way to have that much resource or connectivity. So it's very natural to return a sub part,  a reduced part to a higher level  and the higher level ends up doing the sort for you. So if the host has access to the memory directly,  you avoid a really big copy there. I mean, I think this matches almost exactly  what the Apache ecosystem people are saying. So that, I mean, maybe one of your first demos  could be with Voltron or somebody right in that space  that pushes these concepts  'cause it's sort of perfect for that.

 Okay, great.

Hey Jason, that first sequence that you had in there,  right, the load store to that input buffer  using a CXL.mem, is this the standard load store  or is this something else that's optimized  for something other than like a cache line? Like can we, like typically for storage things,  usually that buffer is at least a page and even larger. So from a performance perspective,  is there a performance optimized load store  for something other than a cache line?

I mean, it could be,  but that isn't necessarily what's envisioned  because one of the benefits that CXL brings  is the smaller transfers. What if we wanna deliberately do a computation on 256 bytes,  which as Kevin explained is our flit size  without moving 4K, right? Now, today with the block transfer,  you're probably gonna wanna move a lot more than 256 bytes. You don't have to, I suppose,  but just from an overhead standpoint  of setting all that up in NVMe, it'd be quite a lot. So I think that certainly if there were some new commands,  then we could take advantage of that,  but I don't know if we necessarily have to have that.

Okay, but isn't usually like,  I see the data is finally ending up  in your storage media.Usually when we target the media,  we do that in pretty large blocks.

Sure, you're absolutely right. And so don't disagree. Certainly one solution there is that  while the host may be chunking in 256 byte flits,  and the computation may happen in smaller increments,  doesn't mean we can't aggregate more of it together  in our SLM before writing it out.

Well, certainly, excuse me, slightly related to that,  how can you actually do the,  why would you have to write it out to the device  and then have the device do the operations  that you're saying that you don't have to transfer it  back to the host? But why wouldn't you just have done the operation  on the host in the first place? I mean, is this an artificial example  that you really wouldn't wanna do,  or I'm not following,  why you can't just do the processing first  before you write it? Why do you have to write it? And then say, oh, now I don't have to read it back,  I can process it in the device.

So I think that, I understand your question,  and I think that obviously these are simplified examples. The goal is to offload the host, right? That's kind of the benefit  that computational storage brings. So sure, absolutely, you could do that in the host,  but that's a burden on the host. If the host can just say,  this trivial operation needs to happen,  I want it encrypted, let's say,  thinking of our security guy over here. Sure, could the host do that? Absolutely, but what if the host just said,  here, you go do it, you encrypt it,  you do all that and take care of it,  and then get it off to the media for me,  and I'm gonna have more of a hands-off approach?

I was just wondering about the capability  of a computational side of things,  'cause I know somebody presented CDN,  and I know with CDN, content delivery to Netflix,  the Hulu and those, they do a lot of transcoding,  and that piece of it consume a lot of compute. And as somebody who work on the compute side,  and on the VM side, they're selling cores,  so they don't want a core to be doing background work. So anyway, can the transcoding,  is it capable of doing something like video transcoding  for CDN kind of applications?

So, yes, it depends on, of course, how you do it. Is a general purpose CPU that you would embed  in a controller gonna do that? Maybe not, you might need a dedicated hardware engine,  but somebody could absolutely do that.
 
So we encountered that same thing  when we were trying to do deep compression  using GZIP on streams, going to NVMe,  and we just used a CSP, so we just slapped a FPGA  in an NVMe slot, and told it it was in the game,  and in the family, and it did the pipelined operations  on the stream, so, I mean,  that seems a reasonable thing to do. One thing about why would you wanna,  well, why couldn't the host do all this? It could, but in the case of a query,  you've written all the data out years ago or something,  and you don't know what you're gonna get asked for  until the query runs years later or something, right? And so you don't have to move all that data very far  if you push the query down. So there's a lot of cases that are trivial,  and just do it in the host while it's there, it's handy. And maybe you build the index as you write it  because it's handy, but later on, it's not there anymore. It's set down on the storage, so moving it is the sin.

Yep. Yeah, absolutely.