117


This talk is a little bit about experience more than anything else and how we can potentially translate this into meaningful things as we go into a full composable memory architecture. So my background is I come from, again, like Larry, actually, storage. A lot of time in networking and storage. And we built, this is a case study we'll go into here, it's actually based on a full transparent tiering model we built. I'm about 50% the coder of this thing, so I know it well. So it's very near and dear and we've had this thing out in the market for around 12 years, about 14 million people, 14 million licenses, about a million simultaneous users that we had on this thing. So we learned a lot and I lost most of my hair during that time, so just if anyone's asking. So you learn a lot from storage. Point being scaling, high availability, redundancy, classification, RTO, RPO, there's all these kind of things that are going to start creeping in. For hyperscalers, maybe not so much, but certainly for enterprise, which is where I spent a lot of my time, it's going to be there. And then the other thing is how do we use all this knowledge to create an effective tiering for software versus hardware defined? You know, we touched on before about hardware acceleration, that's going to become a very important thing for memory going forward.

Okay, so with that, let's kick forward. Just a reset, CXL memory types that we envisage. We are actually, we made some of the first OpenCAPI, OMI based devices out there, so I show a little tiny picture of a CXL DIMM there. There's a few people we've had a discussion with about that, but that's just to promote that. But we all kind of see this as the first real logical deployment in standard computing terms, you know, an add-in card, a CXL controller, DIMM sockets on there, and you even saw the Cisco HPE proposal earlier in specific cassettes. It's all based around the same basic fundamental premise, the PCIe, the CXL to multiple DIMM. And this is real, I'll show you a picture at the end, we actually have this now. The other piece is what I call the, people are calling it the JBOM. I like the word BOM, because it can essentially blow up a lot of things if it does go off, so I think it's got double meaning there. So a JBOM chassis is really putting a lot of memory now outside the server, and you can attach it that way. And we've all seen the long-term picture, of course, of CXL3. In terms of the fabric switching, and one of the demos we have on the Experience Center is switching through the XConn switch, for example, with our E3.S module.

So that's kind of the big picture. This is designed to kind of show, well, what are all the potential tiers of memory that you have that can end up being in a fictitious system here, and what kind of latencies. You've got everything from HBM, you've got near DDR, you've got far DDR, then you've got NUMA hops, you've got far memory, you've got near CXL, and now you've got switch CXL. And one of the things that we learned in tiering for storage was you can model all these things and put them in a table and assume they operate a certain way, but we decided to not do that. We decided to measure it. In other words, when you first power up your system, try and profile everything and make sure you understand that that far system memory controller actually has X amount of latency to it. And that's important because things drift and change over time. Cables can come loose and error rates can go up and things can happen that change the balance of your system. So for tiering, you can actually end up, we had a situation where it was going in reverse. It was being promoted to the worst tier, not the best tier, because of the changing conditions that happened. Now in storage that can happen a lot more than it will potentially in memory, but still it's something you have to be aware of. So lesson number one was measure, don't assume in your architecture as you go forward. So that was one thing.

So just a quick touch point on this thing. This was actually a product that we distributed in my last life, a SMART Modular in our own, this technology. And we put together an enterprise solution initially that was designed for general purpose Linux servers. And Dell eventually distributed this thing as a called virtual SSD. And then we also licensed this thing through AMD and AMD distributed this thing in the gaming market of all places for SSD hard drive combinations. So we got a lot of interesting expert users all the way down to people who had no idea what an SSD was, almost just putting it in a system, putting it together. It had to work. It had to just work. And that's the whole theme behind transparence. Transparent tiering is it just needs to work. You don't want the application guys getting involved. That's the ideal goal. So there's various elements to tiering. We've covered them before, but page virtualization, the operating system kind of does that in SSD. We had to invent our own, but you need that basic fundamental component. Hot page tracking, of course, which is a big theme of the whole CMS theme here for OCP. Cold page tracking. In this case, we had to do both because we had an architecture where we exchanged pages. We didn't just push them up and push one down into a reservation. So it was non-reservation based. We wanted to preserve every amount of storage was presented to the application. So we didn't want to waste anything. Anything you pushed up, push something down. So it was a rotational thing. Background migration. We didn't try to do caching. This is not caching. Caching and tiering, you may have heard me preach about this in my prior life. Caching and tiering are very, very different animals. Very similar effect, but very different animals. And tiering is much more of a deliberate move. And that leads me to the final point is policy management. I'll get to that a little bit later. It was important for us to create a set of policies that were automatic out of the box for most applications, but tunable for those who wanted to be able to tune them. So that led us to a whole stack that had discovery elements to it. The APIs, the virtual SSDs. We could actually present up to 15 or 16. We had to create, we had to get metadata functions. There's all kinds of things that go into this to make it a real product for deployment in the real world. Okay, so that's kind of the thing.

So let me start with the result. A good tiering, good transparent tiering involves little overhead. This is the storage picture we created eventually was that the ideal two lines here you see between the red and the blue is what happens when you go raw to the device versus through your tiering stack. And that needs to be near zero overhead. Now memory is going to be far more challenging to meet that goal if you're going software based unless you use hardware to help you in many cases. But that was the goal and it took us a while to get to this point. You have some CPU utilization because quote we're software, running on a CPU. It takes CPU cycles to do and manage and keep, maintain a number of these statistics. And we'll talk about what that looks like in a second. But the actual IOPS and the latency need to have almost zero impact. And that's very architecturally dependent but we finally got there with this solution. It has to be finely tuned. So the same goal again needs to be set for memory. Otherwise people won't use it.

What's the point? You might as well just buy more memory and directly map it and figure out a way to use MRDIMMs instead of CXL for example, whatever. Little bit on the architecture of this thing that evolved. So this is like we started this whole journey 12 years ago. So this thing has evolved over time. But it essentially had the same common theme. You go from left to right, you have a virtualization. You come in with a host IO on the left. On the right-hand side you go through a virtual mapping function. And you keep as much as you can out of the data path. That's the whole point of putting everything top heavy there. And you have like an analysis modification function that's going on. So what we implemented was a very fast data path through to the NVMe in this case. Or in this case, we think of DDR5 for when we go to memory. Or the fastest CXL or NUMA node that you want to identify as fast. And then you basically had these virtual remapping tables, what we call VMAP tables. They get pretty big as you go up to the petabyte level. This is an architecture that scales to about one, two petabytes overall. Typically implemented about 120 terabytes at a time when we built this out. But the whole point is you have this table. There are various functions in there. You have to maintain tables for virtualization. That's the VMAP table. You have to maintain counters to count the pages. You have to virtualize and page things. A lot of this is built into the Linux kernel nowadays. So we've got to look for ways to augment the kernel to do that. And that is work that's going on. The stats table is very important. We can put a RESTful JSON interface on there and proprietary interfaces to be able to visually demonstrate and show what your loading looks like. You could actually have a live running heat map. You can see the way it's literally ebbing and flowing in real time. It got pretty cool to be able to see that. Most people discovered that I shouldn't be doing that right now. I have nothing going on. What's going on? That kind of stuff. It was a good -- this is where it ended up. It took a while to get to this point. The other thing that was key about this architecture was to push most of the background decisions after the fact. It's very hard to steer stuff on the fly, analyze and steer. So there was a two-second default tick where you look at your statistics tables, you analyze them, you rank stuff, you create a number of cues for where things should go. Then you sit back and let it go. The net effect was that prior picture we saw before. Once you've mapped something to the fast tier, it goes at the fast tier rate. That's the important goal you've got to get with tiering. If you don't pull that off and you end up doing a 20% or 40% -- it's like the first VMware. Those who played with VMware when it first came out and you saw a 60% reduction in your performance, that's not good. You want to see stuff go at speed. So that's very, very critical for us. And the other key thing is to keep it continually modifying without loading the system.

So that's kind of what it looks like. The other experience was -- so the fictitious picture here is as we move forward to think about memory, we've got a typical case of a CXL node now, which is, as we showed on that first picture, I had maybe a NUMA hop plus or plus a CXL hop, plus potentially a switch hop away versus DDR5 directly attached. So you've got a latency difference there and the goal is to try and move stuff up. What we discovered in storage was that even when you're at microsecond level, tens of microseconds with storage, you're talking about nanoseconds with memory, where things got placed mattered. So basically what the CPU and the compilers were doing, because you're trying to be transparent, you're trying to fit in with the existing ecosystem, the kernel has to kind of load and push things. So the virtual mapping tables ended up on a CPU core that was also distant from where you wanted the data to be handled or moved or pushed to where the driver was. So in memory, that's actually almost not fatal, but you can result in easily doubling or trebling or quadrupling the latency. That's bad. In storage, you can get away with it. 100 nanosecond, 200 nanosecond extra time is not so bad in an 80 microsecond world, right? But once you go to memory, you've got to make sure. This is what's driving a lot of the interest in getting hardware-based statistics or hardware-based acceleration functions going, because you do not want to be using your CPU time and letting it get rebalanced by the kernel with these helper functions. Because the only way you can get throughput is parallelization of your code. The code gets distributed. If you have to start getting CPU affinity and locking stuff down to a certain CPU, for example, I want to make sure I run on core B, that might not be the best decision. You don't know what the application is all about or what it's trying to do here. So that's one of the important things that we learned. One of the lessons was with, gosh, you've got to pay attention to software and where it's distributing your code as much as what the hardware is doing.

So in Linux tiering, what's going on in the Linux tiering world? And we did all our stuff in closed source. We had to do this in Windows Linux. We had it running in KVM. We had it running in all these different environments. So we had to be more closed source. But as we look forward, this has to be an open source kind of effort going forward. So what's going on today? You have from Linux kernel 5.15, you have the beginnings of tiering with the NUMA tiering, kernel tiered memory feature. Really what drove a lot of that was Optane, rest in peace. So I think it helped, though. It opened up a whole window for us to start thinking, oh, gosh, you can have big memory that's got a very different latency picture than conventional memory. So that was the good news. That's the one good thing Optane really did for us from a software standpoint. But now you've got CXL adding whole new elements. Today we're talking about fixed CXL nodes, stuff that doesn't move. It's plugged in your system. Dynamic, once you go switching and you go dynamic, that's going to change. So come back to my earlier comment, you've got to make sure you keep track of those changes now in your tiering software. Because something might have moved to a different node or gone. You've now got to be more dynamic in the way you're thinking about it. And there's already efforts in Linux, of course, to make memory more dynamic. You can't unplug memory today without something pretty fatal happening to your kernel. That's changing. That's the good news there. So NUMA to NUMA, obviously there's some things already being addressed. It needs to enhance and to include CXL as part of its hops. That's one of the things there. Discovered topologies at startup no longer work in a dynamic environment. You've got to keep discovering. You've got to continually go back and get either interrupted to let you know something's been unplugged. Hotness tracking in slower tiers without impacting performance. It seemed for us from a tiering standpoint, it was less intensive to track. It was more intensive tracking the slow tiers than it was the hot tiers. We actually saw that. And that was the one that impacted you more because you were trying to move stuff up and you were getting in your own way, essentially. So you have to be very careful how you do that. And the other thing is where to initially place data. We got, in the end, we gave the user the ability to specify mapping of memory in such a way that certain applications automatically got put in the fast tier and some automatically got defaulted to the slow tiers. So I think starting out means you short circuit a lot of learning and a lot of data movement if you can figure that out as you start up the system. So pinning was a very, very useful tool. We learned in tagging certain pages to say, "Don't move," or "You can move only after a certain amount of time," or "You're stuck to the slow tier. Sorry. You can't make it to the fast tier," and mapping that to operating system elements or applications. So that's kind of one of the key themes there.

One of the other things-- how much time I got left? About two minutes. So getting to the last one of the messages here, I think we're going to find a bit of a challenge is promote/demote. Does everybody make this look more complicated than it is just to illustrate? But if you look at where you've got new materials, so the picture on the bottom there, you've got the first step doing NUMA tiering, but once you go to CXL, you now have this new layer. So think of NUMA tiering. You've got CXL tiers. And where do you start demoting? When do you start pushing stuff down to the CXL tier versus keeping them-- you're not necessarily going to tier between NUMA nodes, but you've got to start worrying about where things are going to tier. So you can see it easily gets pretty complicated. And that's where policies come in. So we can identify APIs and policies where they're simple. They've got to be simple. And here's my default policy. I'm going to have a waterfall policy, for example, where I'm going to just go down the tiers. That's all I'm going to do. And I'm going to push stuff up the tiers one at a time. No one can ever jump the queue and go over. That's one way to do it. The other one is to also say, no, I want to always demote to the slowest tier. Just take me off. And no matter how bad I was, I'm going down to the bottom. So just putting some three or four basic APIs in place to be able to set a policy, I think, is going to be an important step, I think, as we start to evolve this whole tiering thing. So promote/demote was a challenge for us. And I think we eventually figured something out. And it did involve pinning. Pinning in the end and associating with certain application gave that application layer and API an extra control just to say, no, I don't want to move that. I don't want it to ever move.

And finally, I wouldn't be a storage guy if I didn't talk about-- we talked about this at one of the sessions-- HA and blast radius. Blast radius, obviously, JBOM. I really meant that. One shared node can take out 16 computers at once now, instead of just one memory module taking down one node. So we're going to have to start thinking carefully about this. Maybe it's not as complicated as this. But if you put the enterprise hat on, a lot of things we were able to get away with in the hyperscale world, because you can replicate data. Now replicating huge amounts of data, though, becomes problematic. So we're back to how do you stop this from blowing up your system? If something goes out-- this is how HA storage used to work, just to be clear. I'm not saying this is how we do it in CXL. But we have already had internal discussions with our sister compute group on how we handle failure. And that's a very, very important aspect of switching. You double up on your switches. You double up on your controls. At what point do you do that? SGH, SMART Modular, is part of Penguin Computing, as well as Stratus. So we have a mix of different folks who can understand this problem pretty nicely and have been living it for a long time.

OK, so that's it. SMART Modular, we got a demonstration with XConn running in the Experience Center. It is not running live in the booth, but they've logged into a live demo. And that live demo is running our E3.S module. And we just have-- just fresh out of the fab is this 8-DIMM board. That's just coming up now. We've now got it talking to the DIMMs and talking to CXL. That's for a good capacity kind of play. So that's kind of the kind of products that we've got running there.

And call to action, get involved with OCP CMS. These two guys here have been doing a fantastic job of coordinating everybody. So check out our demo at the Experience Center and just get involved. Download the specs. It's been an invaluable resource for folks like myself who just want to see, well, what's everyone doing? Where are we? What's the state of the industry? What's the state of the art? So get involved. And hotness tracking is becoming a hot topic, I should say that. All right. That's it. Any questions? Thank you.