388


Good afternoon, everyone. Thank you so much for joining. Before I even start, I would like to thank the moderators for putting me in the best slot possible after lunch when people are having a hard time digesting the food. I hope to make the content digestible as well. Today, first, I would like to introduce myself, I'm Pratik Mishra from AMD. I talk storage. Today, we will talk about what are the storage challenges in large-scale computing infrastructure. As we already know, that storage is just not a storage issue, with the talks in the past couple of days. Even today, David's talk was pretty good that it's not just the storage problem, it's also a compute problem.

And this is what the intention of the talk is. This talk is designed to be digestible for everyone. Um, if you want to talk into more detail, I would uh, like it would be great if you could catch up after the session or in the coming days. And this talk is really designed to be for the consumption for anyone to take out what are the challenges for the infrastructure level and for storage from storage. So, first, we talk about the problem, and then about the different pipelines in AI pipelines for storage, where all storage right where all data lies, what are the compute challenges, what are the network challenges in all the phases of AI, be it data ingestion, pre-processing, uh, training, then emerging AI inference. So, I'll leave uh, I'll focus more on training, that's on inference because my friend Sudesh will be talking about emerging AI inference challenges for storage, and then the final about how it really boils down to the end devices, how it goes through the end devices. And I'll appreciate your questions, and please feel free to ask me easy questions because possibly would have understood through that the presentation that it is highly simplistic. And I'll appreciate really well-done easy questions, complex questions. So, let's go forward with it, um.

So, the problem with AI is, uh, not really the challenges of compute, network, and storage, but it's because it is a bunch of graphs. It's a bunch of like complex graphs, and everyone would know the challenge of it, so I'll also do the same going forward. This is, as we all know, that data is growing in all the V's of the world. We are ingesting in massive volumes, there is high velocity, petabytes of data which is being ingested at a very high speed, there is text, audio, video, whatsoever, sensor data from different sources, different data centers being uh, ingested and processed at the same time, and there is a lot of like accuracy-driven compute which is needed to determine the value for it. And AI is becoming more multi-modal with all these different forms of data being ingested and processed at the same time to get the value for business uh perspective, and it is data is also becoming sparser. When I say sparser, more difficult to compute, that is what it means. So, on the figure, you would see on your right, this is a theoretical calculation of how much is the memory footprint for just holding the model states and activations without doing any data processing with Llama 3, 405B model, we are at this scale around a couple of tens or order of tens of uh terabytes for just holding these massive model states parameters and activations. But going beyond it, the challenges for fitting the data in systems will get more and more intense as you could see from the projections, and this is based on FP16 precision training. So, the bigger point here is that models and data cannot fit in a single system, and irrespective of how much distributed technologies you can come up with, but you, there is a common notion that memory never outgrows the requirements of data, and that is where the challenges for storage come into place. So, let's see like wherever you see that anything in the bold and the red is the main key message in the slide.

So, what is the problem with uh, all these building bigger systems and having less amount of memory? It's the data movement. And why? First of all, in order to fit all that data, or rather, just the models and these activations in memory, you need to build bigger and wider distributed systems, and it will be both scale-out as well as scale-up. Let's take an example. This is smaller than what I expected, but considering that you have dense GPU AI servers with GPUs, your NICs, and CPUs, they are connected to each other through the compute fabric, let's call it the back end fabric, and this is like the east-west traffic, and the point is that the industry as a whole first tried to optimize the data movement within the server, and then rightfully so, as we are building these large clusters, there is a lot of focus in optimizing the east-west traffic. That is, when you have data in the memory across multiple distributed systems, then how do you move the data more efficiently? How do you do compute when there is a large amount of parallelism being involved? Because now we have 3D, 4D parallelism coming in place, so how do you move that data? How do you do collective operations more efficiently over the compute network? Then different, how do you reduce the footprint by quantization and different techniques? This has been a very core of most of the development, what we see today. However, there is a different aspect which is often overlooked, one of them being, what is the core of AI? Compute and data. Where does the data sit? Of course, in the storage servers, and this is what is not a lot of, there is not a lot of focus in this, and because people generally do not consider that data is being stored like at remote network storage and there is a different fabric and different infrastructure to it, and David today from Hammerspace actually dissected this entire problem, so it made my job easier. So basically, the point here is that from your remote SSDs or hard drives, how do you pull the data more efficiently to the GPU when the GPU needs it, and at the volume at which the GPU can process it more efficiently? So this is easier said than done. Because there is a lot of technology behind it. Data may be erasure coded, may be replicated across, and then you have to reconstruct the data and push it to your GPUs at the correct time, and the GPUs cannot be waiting for the data because, of course, we all know that the more the GPU waits and doesn't do actual processing the less is the business outcome, and also money, of course. So the key message here is that despite storage being a key player in AI, it's often overlooked, and it's the least talked about, and this is what we really want to change. So coming forth with this, let's go forward with, like, this talk would cater to all the different phases of AI and how does it impact compute network most primarily for loading data, and how do you store and load data more efficiently.

So, let's understand the complete lifecycle of data in AI rather than going into all the minute details of it. There's a 50,000-foot view of what it really is, and the very first point, I've divided the entire AI data pipeline into three phases. First of all, is how do you get the data, then how do you make value of the data, and then how do you have business outcome for that? So, different phases, the three phases. First of all, how do you ingest this massive amount of objects through multiple different sources, data centers, through different clouds? How do you ingest them more efficiently and are you able to handle that at petabyte scale? And it's not just that you need to store that data. You need to transform that data into using multiple ETL pipelines to create GPU-friendly training like tensors. You have to index the data. You have to annotate it. So there's a lot of filtering which happens, and with filtering, I know the storage community will come up with something innovative with computational storage. So, that is a lot of things are done at the background in order to make training happen efficiently. And then comes the main part to it. What happens during training? You need to load all these parameters. You hear about 1 trillion, 10 trillion, 1,000 trillion, whatever, parameter models. How do you load those models to the GPUs from storage? And then how do you, like, it's just a model, then how do you load the batches of data? We are talking about petabytes. How do you efficiently put it on your training GPU nodes? How do you update those weights because training is all about, it's a learning algorithm? How do you make it more efficient? How do you update those weights and parameters? While also there is one important use case that is AI checkpointing which we'll talk about in detail today. How do you do that? And this is an iterative process. What is the proper storage infrastructure that you need to develop for that? Because there are multiple layers and petabytes worth of data, which needs to be processed and then comes the next phase. How do you take value out of the data? You have learned about what what are the correct set of parameters for your models? But how do you take value out of it? And that is where the inference comes into place where you load these trained model parameters and you need to process these queries in real time with emerging AI applications such as vector databases and RAGs, which Suresh is going to talk about today. The challenge for storage will be more because you will have a larger corpus of data which you need to process at real time, you are ingesting, you are indexing at the same time, and you need to process it on the fly. So those are the challenges what you will see finally, which today Ian spoke about about like archiving data. So, how do you efficiently archive all those outputs, all the trained models, or everything, and this is in petabyte scale and that has to be that feeds into the data ingestion pipeline again. So these are the challenges, however, for infrastructure, for as what is the dollar or the main focus? How do you maximize the compute of the GPU utilization? Reducing the stalls due to storage and that is the question to answer.

Now, let's go into first data ingestion and pre-processing, the first phase, what it really takes to do it, and why is it important? So there was an analysis for them from a hyperscaler. This chart shows the power consumed for different, the three most used models in a hyperscaler environment, and it shows that for all these three major workloads, it's the ingestion and pre-processing that is getting the data in from different sources and putting it, and for doing multiple ETL pipelines to make it more training-amiable structures, that consumes more power than training itself. We have all heard everyone shouting about the footprint of training, but what about data ingestion and pre-processing, and why is it a challenge? And that's what we will uncover right now? So, first of all, there are the data comes through multiple like log streams, log aggregation pipelines, which Kafka streams from different formats, it can be on top of RocksDB or what or any key-value store. So it is heavily metadata-heavy, and the data is mostly sparse. So sparse means more space. Let's just consider that, and more time to compute, and more memory footprint to compute it, and your networks are highly clogged because of it. So the challenges for storage is that this is a petabyte scale? And all coming out from different phases and different phases of the pipelines. And this is a highly concurrent, right throughput-oriented work, which has a very high Q depth, and of course, one of the key points is that you need to ensure data encryption and data safety for this phase because different users will have different requirements, and for a storage, storage password, it's a challenge to meet. Then comes the pre-processing. This is a This is primarily a very intensive workload where you have to continuously transform the raw data into pre-processed tensors for the training jobs to consume. This is and this requires low latency and high throughput. You have to, it's not just one data format. You won't need to change your data to them, multiple kinds of data and multiple transform functions what you need to apply to different to feed to different training jobs, and I'm not talking about one training job, but a class of training jobs, which are running at the data center level, so this involves a lot of decompression and a lot of reconstructions required, and you have multiple format changes. So the bigger problem is that these are the, if you look into the workload point of you, just for pre-processing, it is a sequential right workload, which requires advanced compression and decompression for moving it across multiple data formats, and it is extremely metadata-heavy. You have a lot of indexing requirements, or a lot of pressure think about it in your RocksDB. There's a lot of pressure on that.

So, coming forward with, like, what it actually means is that these raw data is being transformed into different training samples, right? And these are geo-distributed, and the data is usually very sparse. So if you want to break it into a very simple format, think about it that the data, the operations have a very high reduction ratio. So there's a lot of filtering operations which goes around in the background, and when there's filtering, a lot of data is moving from storage servers to the compute, and over the network. And very less amount of data is being actually being used. So that is a problem to solve, and there was a study done by Stanford and Meta and an hyperscaler. So where they thought, hey, we have a lot of compute, a very dense compute in our GPU clusters. So why not use that for pre-processing kind of workload? What they really found out that the GPUs were stalling for data because of the storage servers being not able to feed the data appropriately at the correct point of time, as well as it is bottlenecked by the front-end resources of the network. The NICs are highly oversubscribed, and the CPUs are not able to keep up with that demand, and it is mostly also a network problem because that analysis really showed that the NICs are highly oversubscribed, and they are working at line rate. So this really motivated them to create separate data ingestion pipelines and pre-processing pipelines to cater to the needs of their AI workloads, and therefore it is highly important for the industry to create highly efficient data ingestion and pre-processing pipelines which are both optimized for both storing the data as well as efficient retrieval. In short, how you can do filtering more efficiently. Now, now let me focus on the key aspect of today's talk. That is AI training.

Training is extremely heavy, our storage-centric workload. I'll dissect it one by one. You will see a very complex figure. So, embrace yourself on this. So, this is a highly simplistic view, of course, that's ironic. But still, this is a highly simplistic view of what eventually goes around in training, both from the GPU side as well as the storage side. You would see that here the world is broken into two parts: the dense AI GPU server with GPUs, CPUs, HBM, DRAM, and NICs around, and then for simplicity, the storage server with stores all kinds of data here. Let me just also put some more perspective. You are seeing one GPU. There's the idea of GPUs connected to the compute network area of storage connected to its own network, and both of them connected to each other. And we will call the GPU to storage traffic as north, and east-west between the GPU servers. So, what eventually happens during training? First, you need to load all these models to memory with different parallelism into place, let's say tens of data pipelines, now context of 3D, 4D parallelism into place. You are distributing those model parameters across different GPU clusters, and then you are loading the model. That's 1 to 10, you can read what 1 to 10 is in the figure on the right. But first, you load these; the CPU loads the training models and samples from the storage servers to the GPU server. It's just the model now, terabytes. Let's talk about that. And then the model is loaded into GPU HBM. Now, what next? Is that you need to load the data. Now we have not even talked about the data itself. The data is in the range of petabytes, higher terabytes, reaching to petabytes, but you are loading them in batches. And that batch depends on what's the GPU churning rate and how much can you feed the GPU from the CPU. So, you are loading the data from the CPU, from the storage server to the CPU and then to the GPU, which performs a forward pass. That is, it calculates the loss or the error. It's a complex set of neural networks, and it calculates the error, and once that is calculated, it goes to something called a backward pass. We need to update those parameters and based on your learning. Now comes the main part that once you have done the backward pass, then you have to, because you have distributed your model, so you have to communicate through all the, through the GPUs in your, something called a rank, a set of GPUs. And then you need to store those parameters using an optimizer, like the optimizer states, and those updated parameters using, nowadays they are using Adam optimizer. And then you need to store all that data back to storage. That thing, that process is called checkpointing. We'll get to why checkpointing is important. It's mostly because of reliability purposes. So, and you need to store it in, store it in reliable storage. So that has a footprint of terabytes. Terabytes of data. So why is this a problem? Let's just, this will tell you why. For Llama3-405B training with 16K GPUs, this entire one to 10, steps one to 10, multiple iterations of steps one to 10, led to the model flop utilization to be 41% for these set of primitives. That, what it boils down is that roughly 59% of the time, your compute is not doing what it's supposed to do. And this is a big challenge for the industry to answer. This, looking into it, it is pretty much clear that this entire process has a very high memory footprint. It'd be for checkpoints, for training, for model states whatsoever. Because you are talking to inter-GPU, intra-GPU, and two storage servers, so the NICs are actually highly oversubscribed, right? You are working at line rate. So, there is, the NIC technology needs to like, it has to keep up with the kind of traffic. And this traffic is both the east-west collective traffic and the north-south for the loads and stores to the storage server. Now let's get to what are these two big problems for storage, that is, model and data load and checkpointing. I'll focus more on checkpointing. Let me just go to the next slide.

So, the first problem for training, for storage, is the model and data load. Here on your figure on your right, you would see I have talked about the steps 1 to 10 in my previous slide. That was just one iteration. But there are multiple iterations because till the time you don't read the entire training data, there are multiple iterations, and there are multiple epochs. One epoch comprises of multiple iterations, and there are multiple epochs through which training is done for accuracy, for training your models based on what your objective function is. So, whatever you see on here is there are two color-coding schemes. One is the blue. That is the time it takes to load the model and data. And the other time, the red one, is the amount of time it takes for doing training computation. This includes checkpointing and whatsoever. So, this happens through multiple replays. So you could call this as a highly iterative workload. And model load, again, dissected further. So, the model load is significant. It is in terabytes, but it is once per epoch. The challenge here is how do you load these training samples to the training nodes? And this is a random process; it is sampled, so this is a this is a random IO problem. For just a back of the envelope calculation, for a 1 trillion parameter model, you would require a minimum of 800 terabytes of data for high-end training, so this requires very high read throughput, and there's a large number of small file reads. And all of us know the problem with a large number of small files, right? We knew we know about it from Lustre, we know about it from the talks from different talks that a large number of small files just not is the IO problem, it's also the metadata problem, and that is a very metadata-heavy workload. The next part to it is that like the data preparation that is the data load in cause of very high data center tax because for we have spoken about the host resource and network being highly oversubscribed because the the amount of data and the volume of data as well as the velocity of data, right? But however, that data went sampled through for DLRM data sets, it it had shown high deduplication, so can the storage community do something about it? Can you like because you can you reduce the data center tax, that is the IO tax for moving the data from the storage servers to the dense GPU cluster with advanced deduplication? And then this pre-processing again has diverse functions mostly, it's filtering operations and decompression, and all of us in the storage community are I think we understand the value of compression, decompression, and filtering more than what others would realize, and why is that a challenge? So, there is a lot of and this kind of workload is highly sensitive, like latency-sensitive, so what can we do with near storage computation, and how that could be used for reducing the significant GPU stalls during model and data load? So that is one challenge. Then the next one is going to be heavy, that is the AI checkpointing problem.

We all, over the past couple of days, people have been talking about, uh, checkpointing. To me, so I believe that I should focus more on this. I'll spend around another 10 minutes on checkpointing, and then possibly we would understand what what are the challenges for checkpointing? The first point is that, like, these what is checkpointing? But before that, these training jobs do not run in hours; they run typically for weeks and months. Earlier, it was years; now it's weeks and months, and checkpointing is a mechanism which was being used in HPC context for a very long time to save the snapshots of the memory, so that our under all the vital information so that you restore from where you have checkpointed earlier, and you do not have to do that computation again. So, let's see from an example, what is checkpointing? Figure A shows all your training steps to be done, and you're not checkpointing whatsoever. And when a failure happens, then you need to recover the last progress; I need to start from the beginning. This is extremely expensive. I'll get you some numbers of how expensive it is, not just dollar amount, but in terms of time, money, power, whatsoever, and also all your resource allocation deallocation issues. So, checkpointing is a process by which at certain intervals of time, you are storing all the vital information in reliable storage so that when failure happens, you can resume from the last checkpointed location. That is a very simple concept, so, and but even when you do checkpointing at an early basis, that's the number you see for a hyperscaler customer. Even though if they checkpoint at every hour for a very small nowadays, we call 3k GPU clusters to be very small, that rollback for every hour, which you see, costs their customers around 30k. Think about that when you're talking in scales of 300k, 400k GPUs are training in those at that scale around, let's say our trillion parameter model, you will lose money in millions, even though that you do that, but the the bigger question to uh uh here is that with, we are building these large clusters which have ever-lowering MTBF, that is mean time between failures, you have to assume that there is a failure all the time, so this checkpointing frequency will increase and will have to increase, and the footprint of it will grow pretty quadruply, and this is a challenge to solve for us as an industry. However, I've talked about checkpointing being mostly a reliability problem, but it is also used for hardware refreshes because all these GPUs are running on VMs, these are GPU VMs, and you need to uh like, what if you need to do resource rebalancing? You need to fine-tune and you need early kill if there is any error rates which go up, so that you have a perfect, not a perfect, one near-perfect training model, and so it is also you being utilized for that. Now, let's go deeper into why is that a storage problem? We all know the data problem, right? But why is that a compute problem? Not really a compute problem, but what the rippling effect to compute?

So, this is what we have projected, and what really happens during checkpointing. So, checkpointing as a whole is a very simple process. It is serialization. That is, you are creating GPU amiable or tensor file compatible structures. You're quantizing it. You're augmenting metadata for it so that you can reconstruct it efficiently. That's the purpose of checkpointing. So, that's a serialization job. Then the persistence. You're writing all these tensor serialized files to remote persistent storage because of scalability and high availability. But it all depends on the kind of parallelism you have in your data center and in your training jobs. And it is a sequence of writes to a file or multiple files. On your right, you would see a projection. Again, this is just a projection. What is the footprint of checkpointing? Overall aggregate checkpointing footprint. You are seeing that today, we are at the 405B model range where the checkpointing is around six to seven terabytes. That's the size. But it will grow quadruply because of the kind of number of parameters which will be used. Then the optimizer states and will contain extremely rich metadata for the data size, the reader states, the rank. Because think about it. Why are we doing all this checkpointing? It's because you need to load it more efficiently to a particular GPU. And nowadays, with the distribution of the kind of parallelism we are seeing, in data centers, you need to load it to an appropriate GPU based on the kind of rank it has. So it has to embed all that information. And this footprint, overall compute network storage footprint for checkpointing will grow exponentially with the frequency. And it will get more distributed and complicated. That is, the persistent and the restored will get more complicated.

Now, let's go into just another, bear with me for a couple of minutes. And we will know what really checkpointing is internally. As we all know, training is done through multiple epochs. If we dissect an epoch, it has multiple iterations. Solid green represents when the GPU is active. Red represents when it's doing checkpointing. Now, if you break down iteration, what really happens is that you remember the 1 to 10s which I had. So you're doing the forward, backward, and optimizer. That is the step 5 to 8. Then, per layer, you are doing checkpointing layer by layer. So, a sequence of writes to files. The problem here is that this is a synchronous job. People have been defining asynchronous checkpointing, but there is a lot of things which needs to be done. And why this synchronous problem is an issue is because look into GPU 2. That GPU 2 there is sitting idle till the time GPU 1 doesn't finish its checkpointing because it cannot start the next iteration. Because training is synchronized, and the training is paused due to checkpointing. So, breaking it down to the simplest form, you are constricted by the slowest training node to storage path. This is highly inefficient. Because of the synchronous way of processing data, and it is wasting data-centric resources. This leads to increasing training time, which is pretty important for us to get it down. And it also wastes a lot of resources all across in the data center attacks. So, this is what needs to be solved. That if your training is paused, and GPUs across the ranks need to wait for the data to persist, that wait time needs to come down. And that is the challenge for us to solve as an industry.

Now, let's look into what is the impact on infrastructure. First of all, we spoke about the aggregate terabytes of, you saw that weird chart which showed you terabytes or petabytes of memory required for checkpointing. And you have multiple such files, multiple versions of that being stored every other iteration, right? So what is the training? The impact of this per GPU. So with growing model size, the impact of per GPU grows. With just a small estimate for an hyperscaler, it is around 30 gigabytes per GPU. Think about it. You have eight GPUs, 240 gigabytes. Anyways, your NICs are highly oversubscribed at the server end. And for Llama3 training, it's one megabyte to four gigabytes. Depends on the parallelism involved. And the persistence and restore is getting tougher, more tough with time, with complex interactions of all these different kinds of parallelism involved. Because you need to embed that information. And it doesn't mean that if you're distributing your model across the GPUs, then the footprint gets reduced at that factor. Because you need to store more metadata to it. For the storage servers and the network, what it really means is that you need to persist these thousands of these checkpoints concurrently from just not one model for different models at the same time. So your NICs and your DPUs are highly oversubscribed. You are not able to meet the SLA guarantees because of packet failures, because you are saturating and over-saturating your storage fabric. That is even true for . . . Llama3 405B training. And as I've already mentioned, there is a need for data and control path optimizations. Let's talk about efficient rate-limiting schedulers to reduce these thoughts of failures, having efficient rather congestion control methodologies. So these need to be there. And the goal for the storage ecosystem as a whole, it be compute network or the storage subsystem, is that how can you maximize the GPU bandwidth utilization and minimize the time to load and store checkpoints. So that is all on checkpoints. Now, the easier part.

Let's move to the value part. I spoke about the five Vs. Now, the point is about values. And Suresh will talk about emerging AI inference and what are the challenges there, how they optimize it. But inference is highly latency-sensitive. And you need to, for a storage, from a storage perspective, you need a reliable and fast deployment of all these models, these trained, loaded models, with minimum time to deploy. This is, generically, inference is characterized by small read IOs. And you need to be more performant because of multiple thousands of queries arriving at the same time. There are requirements for scale-up, scale-out storage with high availability and performance. So, the main key takeaway is that you need to have a high-performance storage with low latency and high read bandwidth for saturating all your inference node GPUs with data.

However, the key use case, which, like, don't quote me on that, but is retrieval, augmented generation, and vector databases. Large language models are trained, are trained at one time stamp. However, till the time they are being deployed, that information can go irrelevant. So, for that, there's retrieval augmented generation is a use case where it augments this external information and user queries to retrieve the most relevant information. That's a technique called top_k. We'll talk about it very briefly. So that you can generate the most relevant or accurate response. And the footprint of it is pretty huge, because there are two points to it. First of all, you are going highly multi-modal, with images, text, and videos being processed. Objects. Let's talk about objects here. And then there's continuous data ingestion via Kafka pipelines with a lot of indexing requirements, on-the-fly indexing, and utilization. So, you cannot fit all of that in memory. And again, getting back, data, memory, and data cannot fit in the GPU memory hierarchy.

I will not focus a lot here because I'm running a bit behind time. So, the main point here is that data, on your right, data is ingested. This is a highly sequential write workload. Very heavy on the workload on the NICs. Then you need to embed. These models into moving the already stored models from back-end storage to front-end compute. That is a one-time effort. But it requires high read bandwidth and low latency for fast deployment. There is continuous, as you see, there's continuous indexing requirement that you need to create those embeddings and the vectors. However, the main point here is that on vector databases. So given that you have all the queries, the queries arriving at you, and you embed that query using your embedding model, how do you get the most relevant set of chunks or documents for that particular query, which caters to that particular query? So that requires you. On vector databases are the answers to that, which do continuous data ingestion and indexing. But they do. They do context-aware retrieval, which is search and filtering for a process called Vector Similarity Match. Given a query, what all chunks of data is most appropriate for it, that's the process called top_k. And then that is fed into the large language model, which gives you an accurate answer. So cutting it short, you need to transfer a large number of files between the storage servers to the dense GPU cluster. You need to do vector similarity. And you need to do vector similarity match, which is mostly a collaborative search algorithm, where you have GPU, CPU doing a search over the embedding space and giving you the most appropriate top context. So it's a merge tree. So this requires you a very high storage to CPU bandwidth, CPU to GPU bandwidth, and GPU to GPU bandwidth for data copy and reduction operations. So there is a need to increase the whole storage read bandwidth for faster real-time response. So this talks about most of it.

And finally, this is the most generic slide you would see in all your presentations. That, come on, we have spoken about different phases of these applications running. We have just spoken about one generative AI pipeline. But do you think it's economically feasible to run just one pipeline? I believe it's not. So in the cloud or in the data center, you will have multiple users, multiple AI pipelines running at the same time. So for an end storage perspective, you will see a mixed profile. So we need to really provide performance isolation to these clients. So performance isolation is a very easy term to say, but the most difficult thing to do. And all of us really know that. And provide to every client. Think about it. Even as an industry, we are coming together to solve the congestion problem, a priority-based congestion problem. How do you do it for per client? How do you actually solve this problem for every phase of a generative AI pipeline? And you need to deal with a large number of file operations. So to maximize the GPU utilization, you need to provide the performance isolation SLA guarantees of the next generation pipelines. And this will have a very huge IO Blender effect.

Now, let's go into the final thoughts. Sorry, not the final. So we already spoke about what are the different challenges. And we really realized that out of the entire presentation, the requirements are everything and anything. You can see all these things on your right, that you need, availability, capacity per dollar, you need, whether it's a block, you need to be directly addressable, you need low latency, high throughput, high IOPS. You need to also solve the thermal problem, the low carbon footprint problem, and the low host resource consumption problem. And there is a lot of data, will be a lot of data, and will be required by these dense clusters as fast as possible. And there will be failures all the time. So we need a unified storage platform to embed all these different aspects of the generative AI pipeline, and it needs to capture the different needs of different phases. So that is the requirement.

Next is, next I'll leave some food for thought, is that we need to rethink and redesign the end-to-end GPU optimized infrastructure for AI. When I say GPU optimized, it's also GPU, DPU, NIC, CPU optimized infrastructure, where all the computation, what all the computation can be done more efficiently. There's a lot of talk, and a lot of talk, and also a lot of work behind intra-GPU interactions using UALink. How can you, and that is for the GPU-GPU communication. But at UltraEthernet, is that an answer for the efficient transport mechanism? How do you optimize the GPU storage interactions? And how do you make direct RDMA-like services, or rather microservices, at scale? And possibly for storage, how can you make these accelerator interfaces transparent to all kinds of storage? Then there is also the domain-specific and programmable hardware software co-design requirements. And you need to compute. I have this notion, which I call compute everywhere, anywhere paradigm, where how can you enable CPU offloads? How can you do in-network computation in conjunction to the CPU offloads, and near storage computation using the DPUs and CPUs in the storage servers in themselves? Finally. What is the proper way of doing storage topology? Is it converged or not? How do you, we have talked about, today we have talked about all about the storage foreground processing. But what about the background traffic that consumes around 20% or more of the entire traffic? That's the data reconstruction, rebalancing in the storage servers. How do you balance the foreground and the background processing? These are some problems to solve. How do you define that appropriate transport to do so? And is NVMe fabric? So what is that fabric which we need to redesign?

And final thoughts. I will end, not end, rather there's one more slide for you guys, that we need to reduce the datacenter entropy tax. And you need to maximize the utilization of your compute network and storage resources for not just today. It's a problem for today and will be growing tomorrow.

Finally, I'll leave you guys with this thought: storage, despite being a key player in AI, is often overlooked and the least talked about. And possibly, we have to change it as an industry and need to really think out of the box how we could do that. And that's it. I will open for questions and answers.

So yeah.

One thing that I see missing, and it's still a kind of something missing, which is, like, a, you know, like, a data flow/streaming kind of view of synchronous architecture. Synchronous versus this, I mean, anybody looking at it?

So, like, as a whole, people generally want things to be asynchronous.

I have a question.

Oh, sorry. If I would have understood it appropriately. So do you want to say that, is your question about how you can, like, how you can make it asynchronous? Or is it about, there is a missing feature set?

Synchronous versus asynchronous is all right.

OK.

There are some, you know, there's a boundary of synchronous systems. And the only way we ever scale up is by combining them into asynchronous systems.

Yeah.

So that's one side of it. I'm not a data guy, but I'm thinking through data. But from what I understood, you're putting all this amount of data. But is there a flow way of looking at it? Like, a data flow that says, 'move the data means, you know, you can move the data.' OK. Let's say you moved a gigabyte. But you looked at only one byte out of the gigabyte. So as long as I get one byte per nanosecond, I may still be OK. It's not about how much data I move, right? So that's why I'm thinking, is there a data flow architecture that will be slightly better? Is there anything like a streaming processing that could be better as a final architecture we may end up at some point?

So actually, that's a very interesting question. Thank you for that. So possibly, your question really means that what is the appropriate data flow architecture for trying to solve this problem, right? So there are multiple moving parts here. One of them being that while you're, like, you really don't know what's your reduction ratio, how much amount of data is appropriate out of that large corpus of data, right? So one of them, you already answered that. Can you do stream processing, on-the-fly processing of the data so that data is reduced at every step where it reaches from storage to your compute? So does that really answer your question? Or we can talk later on that. Yeah? Anyone?

So you're saying you need tools and data to train them. You need rules, double-edged rules, in order to be able to do the cross-point time. And then if you do RAG or you do effective DB, you need to improve it. So are there any efficiencies in there that help reduce that amount of storage you need at any one time?

So, if I may understand your question appropriately, it's about...

Or would this maybe have... You need petabytes of data.

So the problem really is, that there is a lot of algorithmic refinements which are there where you can refine the kind of output you may require. That is, one of them is clustering. When you are indexing the data, you are clustering it appropriately so that you don't have to shift all that large amount of data across the network. You are just sending not petabytes of it, but terabytes or even less, based on how you store the data and how you put the metadata and how you cluster it together. So I think Suresh will cover that today in his talk on vector databases. Yeah, please, go ahead.

Just a quick statement for the first question. At OneEngineer, we're exploring the Homa networking technology for reducing the utilization of GPUs. They described it for us as a tidal power plant. They said that these waves come in and go out by a week. So huge amounts of data going one way and then coming back. So think of this. The data flow is a swinging system which swings at petabyte scale over a week. So there is some form of streaming in there. It's just hard to predict. What they're trying to do is, not everybody swings in synchrony. So one in, one out, and so forth. So they try to balance things out for efficiency reasons. My question is, what is your ballpark feeling of the sparsity of the data that is stored and read and written?

So data sparsity will tend to increase because of the different forms in which you are ingesting the data, different dimensions in which you are storing the data. However, there is a large amount of work to reduce the sparsity effect with quantization, with different kinds of reduction techniques. But to give you some ballpark estimates, data, like even today, is sparse. And with all the analysis which is being done by academia and industry both, it will grow at the rate of at least 5 to 10x in the coming years to come. So, but this is, don't hold me up for that. It's just an estimate, right? So that's what it is. Any more questions? Oh, yeah. Can I take one more? Okay.

Oh, it's a question. So in the training, I guess you call it training, but the practice is the training from scratch, or pre-training and fine-tuning. And fine-tuning, sometimes you apply it on the smaller scale system. Do you have any comments as to the storage implications in the pre-training process?

Yeah. I didn't touch upon fine-tuning today. But the problem, fine-tuning is not that intensive as training itself, because fine-tuning is not that heavy on the storage networks. It will be mostly, like, I would rather say that I would use a file system, a network file system for fine-tuning rather than going through the object system. And there is, fine-tuning is pretty much for a large number of small systems to work on. So that's the main challenge there. So it will be, it is more latency sensitive than throughput sensitive. Okay. Thank you.