158


Very good.Good to be here.A wonderful gathering at the OCP Summit.So Peipei talked about algorithms and the tricks in the game to use heterogeneous computing and perhaps managing time for execution for communication versus computation.

Let's talk about some of the systems that we can build to accommodate that.All in all, we are here at OCP.So I've been part of CXL for a while, CXL Consortium for a while.So I'm here to tell you that, hey, CXL is a good technology, but the OCP is a method to realize those things into systems.And we can take advantage of photonics as an interconnect, emerging interconnect for dense systems.

So there are a lot of discussions going on today.I'm not going to be able to cover all of them.But what we're going to talk about today is the fact that systems are getting complex.Individual compute elements cannot grow as much as we want to anymore.Therefore we need to have many of them.Since we have many of them, interconnecting them is the key element that we need to optimize for.So all in all, data movement, the energy that's consumed to do the data movement is important.Sustainability, reducing carbon footprint is all important.So reducing power for moving data is important as well.

So systems that we are building, we need to be able to manage them.It just happens that when we have specialized compute, those specialized compute need to be flexible.We need to provide some flexibility for them.Whereas when we have general purpose compute, the general purpose compute needs to be balanced so that it can support multiple workloads.So these are the main topics here.

A little pictogram here shows that if we are using general purpose compute, the volume of such a thing can be very large.But it cannot support all types of workloads and special needs that we have.Throughout the day and yesterday and today, you have seen that artificial intelligence requires different models targeting different workloads require different capabilities.Some of them require more bandwidth and memory.Some require more computation and bandwidth.Some require more bandwidth through storage.So since these systems are very expensive, as a system architect, it is better if we can use one particular technique and apply it into multiple places.So a semblance of general purposeness will help us here as well.So we need to be flexible if things are specialized.But we can be general purpose if things are high volume.

So we have talked about compute.I don't need to dwell on that anymore.But memory is very important.Peipei told us the compute that we use, that we buy, we don't use all of it.A lot of times that's because we don't have enough memory.So one of the things we talked about today is what is the method to use a little bit more memory.And as we have disaggregated compute, we need to have interconnects.Infrastructure is also, of course, important.So we talked about some of those things.

So for the scale, what is scale?Scale is distance.It is power and cooling.And things are in different chassis.Management and power and cooling and fault modes and power zones are important aspects of that.These are all challenges that we need to deal with.All in all, if we reduce the energy consumed for data movement, we reduce time and we reduce power.And that's all good for everybody.

So simple to say that we need computing elements, we need memory, we need storage.The rest of it is interconnect.So I keep saying that one because that's the important aspect of what we're doing today.But for the interconnect, point to point, topologies, congestion in moving data through switches and such are all important.And if we have simple switches, circuit switches versus packet switches, the time involved in moving data is important.

So a challenge to us have been build a system that can have Exa-FLOP of compute.So the first question is, what is a FLOPS in that regard?Is it, in fact, double precision floating point?Well, certain applications do need that.But a lot of AI does not.So people have come up with clever ways of packing many floating points into the same number of bits.Resolution, accuracy, and dynamic range of such numbers need to be applicable to the application.So if we need to do 256-- if you want to do one Exa-FLOPS, we can come up with an element that can do four Tera-FLOPS, but then we need 256 of them.How do we build a system with 256 XPUs, GPUs, TPUs, IPUs?Well, one organization that seems to be common do things eight at a time.OCP, OAI, Open Accelerated Infrastructure, talks about a UBB, universal building block, that has eight compute elements on it.So let's say each one of those can do four Tera-FLOPS.In that case, if we have 32 nodes of eight XPU each, we can put them all into several racks and call that done.So 256 GPUs.How do we connect them?

So that was compute.Compute is not complete without memory.How much memory can we have local?A lot of these XPUs have local SRAM for caching.But then they use high bandwidth memory for a little bit more capacity, a lot of bandwidth.But that's not enough capacity, so they need locally connected DRAM, for example.DDR bus is a good approach.So ballpark.If we start with the type of things that are available in the technology marketplace and such, using 24 gigabit memory technology, one can come up with a eight stack HBM and provide 96 gigabyte of memory.That can be locally attached to an XPU GPU.Then with not much effort, one can build a three quarters of terabyte of DDR, DRAM, connected to DDR buses.And there are multiple ways of doing that.LPDDR is a good model.DRAM on DDR5 is another good model.So a number of DIMMs or a number of packages will give you that much memory.Another topic that I will highlight today is somewhat new and is enabled through CXL.We can have much larger pool of memory that's enabled through CXL.Therefore, one XPU or GPU can have access to two terabytes of extra memory.Not only that, that kind of memory can be shared amongst other XPUs and GPUs as well.So those are the type of topics that we'll go into.

So enabling technology for these DDR5 buses, you guys know, 64 bits plus the check bits for ECC.CXL can go over UCIE or CXL can go over PCIE.Those are the protocol that can run on the physical layers.HBM is wonderful.It's very wide, over 1,000 bits, 1024 bits.And using 8 gigabits per second, you can get 1 terabyte per second of bandwidth through HBM.And of course, InfiniBand and NVLink are serial versions of interconnects that are available to us.

Now, this is an eye chart.It just mapped the amount of bandwidth that we can get out of these technologies.I'd like to highlight several points.One, some of these technologies are single-ended.You can read or write, not both at the same time.DDR bus is one of those examples.UCIE is another example.Some of these interconnect technologies are duplex.You can transmit and receive and run in parallel.So in what we do, sometimes we have some marketing numbers in our technical data.We talk about peak data.We talk about aggregate data.We talk about aggregate read/write, transmit/receive.And sometimes we talk about only read.So we need to be careful about how we position bandwidth here.And these numbers are all peak bandwidths, not sustainable, not good put, not throughput, but just available peak bandwidth.They're just good enough for comparing.So you see, HBM can give us 1 terabytes per second of data.If I wanted to do that one with UCIE, we can use advanced packaging and match the bandwidth that HBM might need.If I wanted to do that one with PCIE or CXL, I need to be clever.We're engineers.We don't say it cannot be done.We say it can be done, but it needs to have some tricks.So what are the things that we can do with CXL to provide the same 1 terabyte bandwidth?Well, CXL allows us multiple lanes to be ganged up.So we can have a x16 link.Using CXL 3.0, we can run that one at 64 gigabits per second.And now with CXL, we can also interleave.We can grab four channels of x16 and interleave them together.And that's how you get to 1 terabyte of bandwidth for connecting to memory.And the memory could be HBM.If you wish to have HBM off of CXL, you can do that.So as an example, everybody is familiar with NVLink.NVLink uses higher bitrate per lane.In an example, if we have 18 of those running at 100 gigabits per second, we can get to 900 gigabytes per second.So all of these numbers are the same.And it's similar, all around 1 terabytes per second.But I wanted to also highlight the fact that DDR bus itself, one DDR5 bus, can give us shy of 50 gigabytes per second.So you need very many DDR buses to get to 1 terabytes per second of bandwidth.

So Peipei talked about communication and computation and the fact that software guys can be very clever.And the time that it takes to compute and time that it takes to move data, if you can line them up, you can hide some of these latency bubbles.So just-in-time delivery could be a technique to increase the utilization.If the GPU has to wait on memory, well, that's the time that you're not using all the expensive GPU that you have.So if you move data at the right time through the right interconnect, you get there.

So other attributes for connectivity or interconnect is that how many places can I go?Do I need a fat pipe with lots of bandwidth?Or is my application requires that I go to multiple devices so that they can be heterogeneous-- GPU, FPGA, CPU, memory, or storage, or such?Or can I use a fat pipe going between two switches?Those are the topics.Throughput, latency, and the physical attributes, such as signal integrity, those are the important things.How many of these things can I have on the edge of a die, a shoreline?How much of it can I have on the edge of a package?How much of it can I have on a PCIe card, for example?How many of these connections can I have on a faceplate of a chassis?These are all challenges, physical challenges of the demand for interconnect.When we're talking about latency, hop count, we can do some of these through switches.But hop count counts against our latency goals.

So on the throughput, we can increase our capability on the signaling itself.Pulse width modulation techniques, space division multiplexing, that's what I talked about.You could have it x16.So you can space out multiple lanes to get to bigger bandwidth.You can increase the bit rate or transmit rate by multiple capabilities.Space division multiplexing is what the photonics guys are doing these days.So the number of ports for each package is also important.Again, I might not need lots of bandwidth going to multiple places.When I go from one switch to another switch, I need a fat pipe.But if I want to go from one switch or one device to multiple devices, individual link need not be the fullest bandwidth.Aggregate matters because I want to reduce the number of hops.But I need to have point-to-point connection to every one of those devices.

So this is a simple pictogram for showing that we could have tightly connected things in a form of a cluster, like have multiple of them, and be connected through a fat pipe.So this is a conversation about different kinds of interconnect we could use, a fat pipe versus many tightly connected links.So it's not sufficient to talk about bandwidth.Latency is a very important aspect.And in latency, we have distance.There's time travel for light.Light going into fiber takes about 5 nanoseconds for every meter.So if I'm talking about 20 meters, that's lots of nanoseconds.If I go through a repeater, that's at least 10 nanoseconds in each direction.If I go through a switch, 35 to 50 nanoseconds in each direction.Every one of them is against all the things that you want to do.That reduces the utilization of your expensive GPU or XPU just because we are waiting.On the switches, packet switches are very flexible.It can go all over the place.Packet switches can do the job, but it's just like patch panel.It's not as flexible, but you do not incur the latency associated with packet switching.

So we did talk about heterogeneous computing.You can think about number of devices, and these devices could be individual chips, or it can be individual chiplets on the same die.Nevertheless, we need interconnect.So the interconnect could be die to die, could be on the package, or it could be chip to chip, or it could be at the edge of a chassis.Each one of those could be copper.Copper normally is somewhat bulky.If it is external, you have to be careful about noise immunity and robustness and serviceability.So because of all those things, copper interconnect gets bulky.Photonics interconnect could be longer.It could be smaller in structure, and it could have many of them.So those are the attributes that we can have.As part of the enabling technologies, you have a number of capabilities-- UCIE, CXL, NVMe.All of those are good things.Within OCP, we have a number of workgroups and work streams that are actively working on these topics.ODSA is a die to die on package concept.It's open domain specific architecture.DCMHS, Data Center Ready Modular Hardware System, as part of several projects, is addressing capabilities for partitioning systems, interconnecting them through standard methods, defining standard connectors and module dimensions.Open Accelerator Infrastructure is another workgroup that is defining larger systems like GPUs, eight GPUs at a time or such, and cooling and power aspects of that.And Extended Connectivity is the work stream within several projects that talks about defining the requirements and then recommending solutions to CXL and NVMe and SNIA on what those technologies need to do to support the use cases that OCP teams have defined.

So we recognize that for the networking, we have public networks that need to collect data from outside world.A lot of times, those things need to be secure.And then we have private networks, back end networks, that could be a little bit relaxed on certain aspect of it, noise immunity, robustness of that and such.And topologies are issues, point to point, star topologies.Torus is a good model for when you combine several compute elements and get the result and send it back to the next stage.That can be done with some of these layered architectures.Each layer can be mapped to a set of compute elements before data moves to the next element.

This is an example drawn from what OAI is centered around.We have multiple GPUs or FPGAs or compute elements.Individually, it can be connected to each other to form a fully connected mesh, but they could also be connected to a layer of switches so they can be expanded outside.This is a model that UBB and OAI is presenting.

You could, once you build a chassis with, for example, eight GPUs, you can have multiple of these chassis be connected to each other and build larger and larger systems.And that's basically how you can get to 256 GPUs in one ensemble, depending on the number of switches you would like to have, you have a number of hop counts.This is an example that using photonics, for example, you could have only one, basically two layer of switches, one layer of switch per module.From each GPU going to any other GPU, you need to go through only two switch hops to get there.

And this is a pictogram trying to show you that, yup, we need a lot of interconnects.We need a lot of cables or connectors to be able to interconnect, eventually, 256 GPUs, eight GPU at a time, or maybe 16 GPU in a chassis at a time.But there's topology diagrams that can articulate that.

To perhaps basically summarize these things, we need to have balanced architecture for connectivity, throughput, latency, but deployment is very important, maintenance is important, flexibility is important.We could have tightly connected clusters and be connected through fat pipes.