-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path293
51 lines (27 loc) · 30.6 KB
/
293
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
YouTube:https://www.youtube.com/watch?v=AZkNAxFJ9XM
Text:
Hi, this is Gary Ruggles.I'm the product manager at Synopsys for the PCI Express and CXL controller IP.And I also run the solutions core team for those products at Synopsys.And today, I'd like to talk to you about enabling new memory applications using CXL IP.
So the basic flow is I'll show a couple of use cases for CXL, including some new things that I think are pretty exciting.Then I'll mostly talk after that about the Synopsys CXL IP solutions that are available for you to implement this technology in your next SOC.And that'll include some discussion of proof points and interoperability that we've done, and then a brief summary.
So first, I think it's helpful to understand the landscape and some of the things that are driving the need for CXL and the adoption of CXL.If you look on the left-hand side here, we've traditionally had memory tied to individual chips.Here, I'm showing it as what we're calling XPUs or CPUs, DPUs, GPUs.You can take your pick.But these chips would typically have memory either on the same SOC, in some case, multi-chip modules where you might have HBM memory in the same package.But the point is the memory's tightly coupled to the chip themselves.And in the past, before CXL, a lot of this chip-to-chip communication was via something like PCIe.So with this as kind of the backdrop, we have this center box showing the rise in CPU core count.So we're getting way more computing capability.This is going up like 3x from 2012 to 2020.And at the same time, the memory bandwidth per core is dropping.So, in other words, we haven't been able to fit as much memory per core into these systems to keep up with how many more cores we can cram in; so, this creates kind of a memory bottleneck.And CXL, one of the great things about it is its ability to not only expand memory that a lot of people have heard of, but I'll talk a little bit about some of the ways it can include memory sharing, memory pooling, and improve memory utilization.On the right-hand side, we're just looking at the hierarchy here of the response times, the access times for different kinds of memory, and seeing CXL falls somewhere in the middle.And of course, this depends on your system architecture and what you're using CXL for.If we're talking about attached memory for some of these CXL memory add-in cards or something, then this is probably a pretty good indication.If you're talking about CXL that you're accessing maybe in a multi-chip module, it'd probably be quite a bit shorter than this.
Also, to put this in context, I think it's worth looking at some data that we have here at Synopsys that tells you about how quickly CXL is being adopted.So what we're looking at here is the number of cumulative CXL licenses that we have done, at Synopsys, over the years.So this is including PHYs controllers, and it's including CXL2 and CXL3.So it's adding everything together.And if you look at what's happening here, this is actually super linear growth.So we're in a phase where not only is CXL use expanding, but it's expanding non-linearly.It's accelerating.So that's pretty exciting, particularly for an IP provider like Synopsys.But it does mean these applications are really starting to hit the mainstream.As of today, we now have over 170, CXL controllers and PHYs that we've licensed, including 40, more than 40 already for CXL3.And we've also got over 35 with IDE.So even the encryption and security functions that are relatively new to the PCI Express and CXL standards are already supported for CXL3 here at Synopsys.
I wanted to share this.You've probably seen this.If you're familiar with CXL, you've seen this slide showing the kind of waterfall of the features as they're added into the different CXL standards.So CXL 1.1, frankly, wasn't all that useful.It was basically device to host only.CXL2, the main thing it did was add switching.So that's what I have up here at the top.We could switch only a single level.And we could pretty much only switch one type of device.When we moved to CXL3, we added multi-level switching.We added multiple device types.So you could now switch CXL.cache, CXL.mem, CXL.io.And some peer-to-peer accesses were enhanced.And then also, we added the capability to operate at 64 gigatransfers per second, which most of you know as PCI Gen 6.So we had the ability to go faster.We had lots more switching capability.And we added memory sharing.And I'll talk a little bit about that so it doesn't get confused with memory pooling, which was already supported.Some enhanced coherency, things like back-invalidate standards, things that were designed to give a degree of symmetric coherency to a standard that essentially was a host-managed device coherency.And then as we moved to CXL3.1, which is the latest released iteration of the spec that came out back in November of 2023, it's really focused now on fabric.So we kind of went from switching, multi-level switching over to enhanced coherency and now fabric.So there's all these features in there that are really intended to enable this vision of all these CXL devices interconnected into a fabric where all the memory on every device can be shared by anyone in the system that wants it.And that's a pretty powerful concept.
So if we look at this slide, it's just showing one kind of conceptual use case.And this has now been productized quite a bit as recently as the FMS 2024, which just happened.There were at least four vendors that I could see who were showcasing memory expansion cards or devices at that show.That included Marvell, Micron using microchip controllers, Samsung and SK Hynix.And this is an illustration here of kind of one of the benefits of doing that.You might just say, hey, I have an SoC.I'm used to using DDR.Why don't I just put more DDR controllers on here?I can always get more memory that way.And yes, you can, but there's some inherent limitations.One, these are very wide interfaces for DDR.They also tend to need to be the same kind.So you can't have like a mixture usually on the same ASIC of DDR5, DDR4, and LPDDR, whatever you might want.And again, it's a wide range.It's a wide range.It's a wide interface, so it's a lot of wires.It gets a little hard to manage.So if you move to something that's a CXL-enabled memory or storage device, which can also be an option, you open up a lot of possibilities.So instead of all these DDR controllers, you include these CXL interfaces on your ASIC.And you can see here I've still got the DDR memory, but instead of putting more, I've added all these CXL interfaces.And now this is just a standard serial interface.So if I have a controller on the other side that understands CXL, it can be anything I want.It could be DDR.It could be some kind of flash memory.It could be any kind of persistent storage today or from the future.And these can communicate over CXL.And so it gives you this capability.And it even gives you kind of another neat capability that some of our customers have leveraged, where if you have these CXL interfaces on your ASIC, yes, you could connect to a memory extender, but you could also connect just to another device using CXL to go from an ASIC to an ASIC.So you could multipurpose these connections.So this is very powerful and gives us a lot of new and exciting.Memory options.
So I mentioned that I'd look at this briefly.You've probably seen a slide like this.This is trying to show a concept of the memory sharing and memory pooling that was introduced in CXL 2 and 3.And here we're showing a CXL 2 or CXL 3 switch.I've adopted the nomenclature of 3.x.Already we've had 3.0 and 3.1.There's lots of ECRs that are going to become ECNs that probably will roll into a CXL 3.2.So from this standpoint, I'm calling it CXL 3.x.But what you can see here is we've got multiple hosts across the top of the switch.And then below the switch, all these different devices.And these could be any kind of devices.And they've all got memory on board.And the neat thing that CXL enables here is two kinds of memory efficiency improvements.One is what we call memory pooling.And that you can look at just as the color coding here.So device pooling.So device one, for example, has got some of its memory that's being used by host one.Some of it here is being shared by a couple other hosts here.Device two has got a mixture.It's got allocations to host one and allocations to host three and so on.So the point is the pooled memory is where you've taken your whole pool of memory and you've allocated some of it to this host and some of it to this host.The shared memory, in contrast, is where I've actually taken memory that's on one of these devices.And I've said, you know what, this memory is going to be shared coherently among multiple hosts.So this host, for example, in the picture, host three, has got shared one copy from here, shared two copy from here.Host N over here has got shared one copy also from here and shared three copy.I mean, these are made up examples.But you can see this not only allows us to have memory pools, but also to share the memory.So when you put these concepts together with the multilevel switching and the fabric, which included the Global Fabric Manager and the ability to have all these different devices connected, not just in a switch, but in these fabric arrays, you have the capability now to be highly efficient in your deployment of memory.So not only can I put more memory on my SoC in my device, but I can theoretically get a much closer to 100% utilization of that memory.Some companies have done studies where they've looked at this problem and found a shocking lack of use of memory in some of the servers.Because as I said on one of the earlier slides, when you have all the memory tightly coupled and allocated to a chip or to a module, that means that memory is only available to that device.And so if that device is only using 20% of that memory, guess what?80% of that memory is utilized in the system.So this kind of sharing and pooling allows us to take memory wherever it's available and use it wherever it's needed.So it becomes very efficient.
Now, I wanted to share kind of another view here that I think is going to be really interesting.So it's getting a lot of press.You've certainly heard about all the excitement and buzz around AI.Maybe you're fortunate enough to have bought NVIDIA stock a while ago, and you've made a ton of money.They've got their proprietary NVLink.And there's another thing that's been announced here called UA Link, which is Ultra Accelerator Link.The point is, this is a standard.And these are standards to allow accelerators to talk to each other in a very efficient way.So typically, in these systems, we'll have multiple CPUs.We'll have switches or direct CPU to accelerator connections.These will typically be PCIe.They could be CXL.But they connect to accelerator.And then we've got some kind of connection among all these accelerators through switches or directly that's through something like UALink or, in the case of NVIDIA, NVLink.But now, the really exciting thing is even this has some limitations with respect to memory because these accelerator chips have the same problem that I alluded to early in this discussion, which is they have a certain amount of memory.Usually, it's HBM memory.And it's probably in a multi-chip module.And it's associated with this accelerator.So this accelerator has a certain amount of memory.This accelerator has a certain amount of memory, and so on.So what if you can use CXL, the way I showed in the previous slide, to basically come off of these accelerator chips using a CXL interface to some kind of a CXL memory device?I get two benefits.I get more memory per accelerator.And now, because of the way CXL does memory sharing, memory pooling, and multilevel switching, and all the other kind of things that are possible here, I potentially get much better utilization of available memory.So in these accelerators, if I've got memory that's just fixed with each accelerator, I have the same problem that I may not be utilizing nearly only a small percentage of it instead of nearly 100%.So this enables a much more efficient utilization of memory.
So now, kind of segueing into how we can help enable customers to build some of these exciting devices, this is an old slide.I just put it in here to kind of kind of put a benchmark of when we started this.So at Synopsys, we started our CXL IP development with an announcement in 2019 of a complete CXL IP solution, which meant PHY, controller, verification IP, and even an IPK, which is our IP development kit.So all this was available.You can see the press release we did here.We were able to get this kind of jumpstart on CXL because of a close relationship that we have with Intel.We started doing this when it was called Intel accelerator link, and it was only at version 0.7.So it allowed us to get a head start over our competitors.We developed a complete solution here.This was at the time of CXL 1.1, which quickly went to CXL 2.0.
So if you look at the Synopsys offering for CXL, we now have 2.0 and 3.0.I often call this 3.x, again, because of the 3.1 and probably the upcoming 3.2.So we're compliant with CXL 3.1, 2.0, and of course, the backward compatible CXL 1.1 specifications with the cache coherency and all the features that are in there.And we also support IDE.So I have one slide to show this in a little more detail, but we've got this as a block in here.So if you're building something and you need security, we can support that.We support device, host, and switch port modes.So that includes a dual mode, which allows you to switch between device and host within the same ASIC.So some of our customers will build a chip.They can then use that chip as a DUT, DUT for DUT testing, where one side is programmed as host, the other side device, and they can create a complete CXL link there.This was built around our CXL.io, which is our PCI Express 5, or in the case of CXL 3, it's our PCI Express 6.So we're leveraging proven technology there.We've added a 1,024-bit data path version so that we can support 64-gig pipe type 16 lanes, which requires 1,024 bits.This was introduced for the Gen 6 version, which is the CXL 3, the highest speed.We have configurable device type support for type 1, 2, or 3.And this is configurable at compile time or at runtime.So you can build a type 2 device, and you can make it a type 1 or type 3 or type 2, depending on how you advertise it at reset.We have some interface flexibility that I'll show in another slide using our Synopsys native interface, which just implements the channel interfaces for CXL or an AMBA interface on the .io.And the CXL interface on the .cache/mem.That supports things like CCIX over CXL.Some of our customers that used to be CCIX customers wanted to retain some of the symmetric coherency when moving to CXL, and this allows them to do that.So I'll show what that looks like.Uses our silicon-proven PHYs.For PCI 5, I'm not sure the exact number of foundaries we're supporting.It's 15 or 16 or so.For PCI 6, it's rapidly growing.We've got multiple foundries supported.It's supported for the PCI 6 PHYs already.And we've got major host CPUs, including Intel Sapphire Rapids that we've already proven CXL and PCI Express interoperability with.We've also taken it to the workshops, et cetera.We also have some unique features that some of our customers that are building ARM-based designs are really excited about.This includes supporting their local translation interface, which is a way to move the translations out of the data path so that they don't slow down the overall throughput.And the MSI-GIC, kind of similarly, moving the interrupts to a separate interface, interfaces to the ARM generic interface controller.And then we also support more recently this whole concept of the ARM Trusted Execution Environment, which is the host side for a TDISP-like implementation, which is the ARM Confidential Compute Architecture, as they call it, or CCA.And we've done some of the industry's first two-party interrupts and public compliance demos.I'll show a slide on that in the next slide.We also have the verification IP that we ship along with the controller and the CXL IP prototype kit.
So I'm not going to go into all the details on this.And frankly, I'm not the security expert.We have a team who is at Synopsys.But we have a fully integrated and verified security capability here that implements the CXL2 or CXL3 IDE security.And we do that in a way that's flexible, programmable.So you get the IDE module.And you set that up the way you want.You can say, hey, I only want to encrypt the CXL.io traffic, for example.You can do that.Or you can say, I want to only encrypt the CXL.cache/mem trap.You can do that.Or you can encrypt all of it.And you can choose to go that way.It includes FIPS 140-3 certification.There's in-order bypass mode support, efficient key controller refresh.And then some of these other things, I won't go into all these, but containment and skid mode are supported in lots of different data bus widths to match the controller data bus width, which is important.OK.So what we do with this is we pre-verify this.So we take the controller.We take the IDE module.We put them together.We build in the verification environment.We verify everything.And that happens before it's delivered to you.Specifically for CXL, and you'll see this in an upcoming slide, we have support for the CXL version of what was called TDISP, or PCI Express, which is called TSP.
OK, I mentioned the interfaces we support.Here's kind of a diagram showing what that looks like.So for the native interface, we're just implementing here, over here, we're implementing our standard PCI Express interface that's now called CXL.io.And that's described in our data book.Some people listening to this may be familiar with that.Also, we implement the CXL.cache .mem as channel interfaces.So if you look at the spec, there's all these different channels defined.And we implement these pretty much right as they're defined in the spec.And then, of course, down on the bottom, we connect to a synopsis Gen 5 or Gen 6 Phi through the pipe interface.On the right-hand side, we're showing you the interface.We're showing a different implementation.And this originally was targeted towards customers using something like an ARM coherent mesh network, 700 Campos, or something like that, where they use a CXS interface on the cache/mem side, and they use, typically, AXI on the CXL.io or PCI Express side.So this combination of CXS and AXI plugs directly into one of these CMN blocks.And then the conversion inside the coherent mesh network is made to support CHI, et cetera.This is a pretty cool implementation.A lot of our customers like this if they're using ARM.We've had some customers decide to use this even if they're not using ARM.And the reason is this is a generic, accredited streaming interface here for CXS.So when this is implemented this way, we kind of don't care what's coming across this interface anymore.We're no longer doing the flip packing and unpacking that's happening either here in the CMN or up in the customer's logic.So it makes the application more difficult if you're implementing your own logic to do this using CXS.But you do get this simplified interface.You do get a little bit lower latency in the controller, although you have to make up that latency up here.Now, we have a couple other capabilities with the right-hand side here.I mentioned CCIX over CXL.So since this is just a streaming interface, we don't really care what's in these flits.You can take the data in the form of CCIX protocol and pack that into these flits, and it comes across here.And we just receive it, and we pass it along, and as long as the other end of the link understands what you're doing there, this works fine.We've had customers implement this for something like CCIX 2.0 or CCIX over CXL.We also allow mixing and matching here.So this is not the only two options.You can take the AXI bridge on the CXL.io side together with the native, which looks like this left half of the left picture.Or you can flip that around.You can take the CXS on the cache/mem side directly with the native on the PCIe side.So that allows some additional flexibility, depending on what customers are trying to actually implement.
I wanted to mention a couple of the features we're supporting that are for advanced ARM designs too, because I think they're unique to Synopsys.So for ARM-based systems, there's things like SPSA, there's server-based system architecture, and there's an ECAM, which is enhanced configuration access mechanism defined in there.So we have a special implementation for that.We support a couple of features that were defined at AXIG, one of them called Unique ID, one of them called Read Data Chunking.Also, I already mentioned the LTI and MSI-G.All right.So these are already supported in both PCI Express and in CXL.For CXL, it's really part of the AXI bridge, so it's part of the CXL.io side.But it is supported here, and we have customers using that in the CXL-based controllers.And then we also support the ARM confidential compute architecture, which has a lot of features.I won't go into these, but it's a bunch of advances in LTI and DTI and AXI in order to support confidential compute architecture, which is their trusted execution environment.So this is, again, on the CXL.io side.It's also at our PCI-6 and PCI-7 controllers, and it's here for CXL-3 or CXL-2 as well.
So one of the things we focus on a lot at Synopsys, because we're very early to the market with our solutions, is doing interop testing very comprehensively.Well before we can take something to the PCI-SIG integrators list, before we can take it to the compliance workshops, we're interoperating with anyone who we can.So we've done this several times.We've done this several times.Through the generations for Gen 4, we've done it for Gen 5.Now for Gen 6, if you're following what's happening, so far there's been a pre-FYI compliance event.So for PCI-6.0, there's only been a pre-FYI event.That means it's not even FYI yet, which is where we're working out the kinks with the test methodology, the test equipment, and the actual IP, and it's not then the real one that gets you on the integrators list.So there is no 6.0 integrators list.So so far today, what we've done at Synopsys is we've taken our controllers.Our RCN endpoint, and we've taken our PHYs, and we've gone through PCI-5.0 compliance using the 6.0 solution to get on the integrators list.So we've done that already.Currently, we've also gone to the pre-FYI compliance events.We're not allowed to talk about details of who we've interoperated or anything, but I can say this, that so far at the pre-FYI compliance events, Synopsys is the only PCI-6.x host that's been available.So everybody that comes there trying to work out the kinks and all that is plugging into the Synopsys host.And this is something that we typically see every generation, because we're first with the host.We're one of the first with the actual standard compliance.And the reason I'm mentioning this, even though it's a PCI Express thing, is because this is an inherent part of CXL.CXL has to support PCI Express.In the case of CXL2, it has to support PCI-5, and CXL3 has to support PCI-6.We are the only PCI-6 system that we're aware of that's interoperated with the Intel Gen-6 test chip.We did that at the Intel Innovation Forum last year.And we've, as I mentioned, we've passed PCI-5 compliance.And we've included on that our CXL2 solution.So if you think about CXL, people have heard of the CXL integrators list, which is a new thing.But the CXL integrators list has a series of tests that test the CXL protocol.They don't test PCI Express at all.So even though you have to be supporting PCI Express, that still requires you to be on the PCI Express integrators list.So we're on both.
And in terms of the compliance interrupt we've done, we have this system.I'm showing a block diagram here.We've talked to Intel Sapphire Rapids, and we've done protocol negotiation, cache compliance algorithm 1A, and a bunch of other things.And we've done this as our own end-to-end link with all Synopsys IP.And then we went beyond that a couple of years ago at SC22 in November, Super Compute.We took the same kind of system, but with a LaCroix exerciser and analyzer.So here we could not only exercise the compliance that's built into the spec, but also look at the link layer compliance tests as confirmed by the LaCroix.
Here I'm just showing the results of this for CXL.There's not much on the integrators list yet, but I'm showing our two entries here.We're the only one that's a host and type 2 from an IP provider.And as I mentioned, we're already on the 5.0 integrators list.So this is an important proof point for customers implementing these systems.
We also did the first 64 gig link up with a third party.And as I mentioned, we do these things early because there's very little IP out there, and this was one of the first things we could do was linking up with Keysight.We went on over here on the right and linked up with Intel recently with our PCI 6 chip.And then I have in the middle here a CXL proof point, which is cool.It's the CXL 2 switch chip from Xconn.They're customer of ours.They've announced that they're also going with us for their complete CXL 3 solution as well.So they're providing some of this infrastructure that I showed in that previous slide that allows you to build these really cool systems.
Then on the PCI 6 side, we've got a snapshot here, some of the integrator list entries.We've got four more now that we're on there for PCI 6.0.1.PCI 6 has evolved as well, as you may know, 6.0, 6.0.1, now 6.1 and 6.2.So we're compliant all the way up to 6.2 with the various errata, et cetera.
OK, so I want to just say something about CXL 3 and 3.1 feature support at Synopsys, and then I'll wrap up with the summary.
So there's a lot of features introduced in CXL 3, we're supporting architectures that allow us to go from native x16, x8, x4, x2.As I mentioned, we have the native CXL support.We have the AXI plus CXS interface support, standard flit support, latency optimized flit support, which if you follow the spec, it's kind of the renaming of what was originally going to be 128 byte flit.They've decided instead to put 228 byte flits into a 256 and call it 256 byte latency optimized flit.Device host and dual mode support and switch support.For CXL, switch support gets more complicated because they have hierarchy based routing, which is kind of just the multiple levels of switches that we saw one level switch in an earlier diagram.And then they have port based routing, which is really what you need for these fabrics where everybody gets a number and you can just send these packets wherever they need to go for these flits.We support the back and validate and the cache scaling ECNs, multiple logic devices, and now we're supporting the low power mode L0P, which is a PCI 6 construct in CXL 3 mode.So we also have the PBR support for endpoint, which is required for the global fabric manager and IDE for CXL 3 native as well as for the CXS interface.So you can get the IDE support for either one of those.I want to mention this last one because this can get really confusing.So there is something called UIO direct peer to peer to host managed device memory defined in 3.0 that requires UIO support for it to work.That's the way it was defined.So this means that you have to when you're licensing IP from Synopsys or anyone else, you have to make sure the IP supports unordered IO, which was an ECN to the PCI 6 spec or you won't be able to do this function for CXL 3.
Then CXL 3.1, there's a lot of stuff to find in there.Most of it doesn't really touch the the the controller kind of at the RTL level.It's more up in the application or even in the firmware software.So what we've implemented is the support for the applicable errata.We have to implement that and we can say that we're 3.1 or 3.2.When that comes out, we'll do the same. There's something called direct peer to peer to CXL.mem.So this is a peer to peer.Mechanism from memory to memory.
Whereas the other one that I mentioned in this slide, this is for the CXL.IO side, which is why it requires the UIO.
Then this one TSP I mentioned is similar to TDISP.This was defined in CXL 3.1.This is already supported in our controller.And then there are two more features that are that are being supported.Our controller extended metadata trailers and header logging.And just one other note, kind of a quirk here.When they went from 3.0 to 3.1, the port based routing support specifically now calls out that it must support UIO traffic.So in 3.0, it didn't say that.So you could say it was ambiguous, but you could technically have PBR support in your solution without having support for UIO traffic.But now, as defined in CXL 3.1, you do have to have that specifically and we do support that.
So now looking at a summary, we have the broadest CXL IP portfolio available in the industry here with best and class performance.This includes complete solutions via controller verification IP, IP prototyping kits and IDE security solutions for both CXL 2.0 and 3.x.We've got the PHYs now for 64 gig and 32 gig.Those are proven in silicon.We were the only early contributor for CXL, giving us a jumpstart, which is why we're so far ahead here.So if you look at the number of licenses, 170 CXL licenses, we've got over 450 PCI-5.We're now actually already at 90 PCI-6 licenses, including 40 for CXL 3.0 and 3.1.So this is really an indication of how quickly this stuff is rolling out.And in this number here, I'm counting the CXL 3.0 licenses since they have to support PCI-6.We've got multiple licenses for CXL 3.0, including IDE.We've got lots of customer proof points now for our one gigahertz timing closure at 32 gig, which is 32 bit pipe and at 64 gig using 64 pipe.I mentioned that we have a switchboard support that's both for the hierarchical based routing and port based routing.Optimal data path architecture, as I mentioned, from 128 bit all the way up to 1,024 bit.And then we have the core consultant tool.If you're familiar with the synopsis cores, you know you have this ability to go to your desktop, configure the core, try things out, simulate, get results, and do rapid comparisons of what works best and make tweaks to your configuration.We've got a very advanced debug error injection statistics features that have been in part of our controllers for a long time, and we've added CXL extensions to those.to make it even more useful.These low latency interface, the native interface, as well as AMBA and then the advanced arm features that I had mentioned.
So that about wraps it up.So if you want to find out more about the Synopsys CXL IP solutions, please contact your Synopsys sales rep or FAE.And if you're not sure about who that is or if you'd like to contact me, you're welcome to.My name again is Gary Ruggles and you can reach me at ruggles at synopsys.com.So that's it.Thank you very much for watching.