108


Good afternoon, everybody. I hope you had a good lunch, and thanks for coming back for our talk. This is Navneeth Kankani, a platform architect representing Uber's infrastructure team, and with me, my colleague, Bo Ling, who is a software engineer in the ML platform team. We are both here to delve into the challenges and opportunities as we navigate through the path of effectively scaling Uber's infrastructure for AI/ML use cases. Our talk will be focused on meeting the demands of highly upcoming and dynamic generative AI use cases. We'd like to invite you all to join hands and explore the hardware software co-design solutions and engage in the OCP community, especially the CXL enablement for AI/ML workloads. And we believe together we can drive this innovation and share our learnings on how to efficiently scale infrastructure.

So with that, I would like to delve into how ML is deeply integrated into every facet of our business. Taking a look at the Rider app, ML is used behind user authentication, ranking and recommendation of products, ETA predictions, pricing strategies, driver/rider matching, and even in our efforts to prevent payment fraud. These are just a few examples of how ML pervades and enhances the overall Uber experience.

Now let's delve into the Uber ML lifecycle, starting from model development to productization. At Uber, the data is stored in our data lakes, managed by our data infrastructure team. To kickstart the process, we have a bunch of Spark jobs to load, prepare, and transform features within this data set. And these transformed features are then stored in the basis feature store. In parallel, we have a lot of streaming data from Kafka that we harness, which is used across various models and stored in either real time or batch feature store. Moving on, we perform multi-node training jobs using our training module. And these models are then stored in the internal model store for internal checkpointing, operational management, et cetera. And finally, these models are served through real-time or batch prediction services.

Now throughout this journey, we face a multitude of infrastructure challenges that impact our operations. To start with, we face and deal with a lot of I/O bottlenecks during reading the data from data lake. The transformation process on the Spark jobs, we face a substantial compute and memory capacity challenge. Moving on to the training phase, we not only face another round of I/O challenge, but also require high teraflops and network bandwidth constraints. And finally, in serving, we require memory bandwidth for meeting acceptable serving latencies. So all in all, these challenges, in one way or the other, slow down our training process and also limit the volume of data that we can train with.

To further understand the scaling challenges, let's take a look at the evolution of ML models divided into three distinct phases at Uber. The first phase started back in 2016 when our ETA pricing strategy teams started to look at the feasibility of moving some complex rule-based systems to machine learning models. At that time, our CPU-centric infrastructure was capable enough to host these models. Back in 2019, we embarked in the second phase where most of our business-critical use cases were using ML in their core flow. Until then, our models were mostly tree-based, like XGBoost, and deep learning just started to take off. And fast forward to now, 60% of our tier 1 use cases are using deep learning models. And these models require GPUs. And we are also looking at the CPU and GPU heterogeneous architectures to enhance the infrastructure efficiency. The third phase started recently with the advent of GenAI. And we are exploring the end use cases for GenAI at Uber. And this invariably requires substantial GPU capacity in hundreds and thousands of GPUs in order to meet the reasonable training times and acceptable serving latencies.

Now let's delve into the GenAI workloads a bit further. At Uber, the GenAI efforts are centered around three primary implementation options. On one end of the spectrum is prompt engineering, where we use enterprise vendors' large language models straight out of the box through their APIs. This gives us the fastest time to market, but also comes at a higher cost. On the opposite end of the spectrum is using the pre-trained foundational models with massive data sets. This is the slowest approach, as it can result in training times spanning days, weeks, even months, and requires substantial GPU investments, and hence comes at a very high total cost of ownership. In between these two approaches lies the fine-tuning approach, where we use models like Meta's Lama2 model and tailor it for Uber's specific needs and use our own data sets. This is quicker in terms of training times and comes at a lower cost of ownership because we host it in our on-prem or cloud infrastructure. Now the fine-tuning can further be divided into two distinct strategies-- full fine-tuning, where we adapt all the model hyperparameters for a specific task, and parameter-efficient fine-tuning, where we only adapt a subset of parameters. From the infrastructure standpoint, our strategy is to use the existing ML infrastructure and get the maximum out of our current resources for the parameter-efficient fine-tuning. For the full fine-tuning and pre-trained models, it necessitates both scale up and scale out upgrades. This means technologies like NVLink, full mesh NVLink across all GPUs, upgrading the network link bandwidths, managing congestion controls, establishing dedicated racks, and optimizing network topologies, to name a few.

Now let's move forward by defining some specific key results and metrics that we need in order to meet our goal of optimizing existing infrastructure and building new infrastructure for GenAI use cases. The first one is feasibility and reliability. Users engaged in ML training expect their models to converge without errors within a reasonable amount of time-- days, sometimes weeks. Moreover, they also expect high training stability, north of 90%, 95%, to maintain consistent and reliable results. The second is efficiency. We constantly prioritize and benchmark various GPU system configurations, accelerators, CPUs, in order to get better price performance ratios for specific workloads. We measure efficiency in terms of teraflops, model flops utilization, and training times. Our goal is to make sure the GPUs are not idle, and we enhance and maximize the utilization of GPU resources. And lastly, the developer velocity. We measure developer velocity by the number of experiments our engineers can run in a period of time, and time to productionize a model. We also enhance developer velocity by prioritizing maturity of the ecosystem, like CUDA compilers and Python frameworks and so on.

So in meeting these goals, we do encounter a few challenges. It is hard to design an efficient AI/ML system design for the diverse workload requirements of AI landscape, like we saw from XGBoost to deep learning to large language models. For example, large language models require high compute density and teraflops, whereas deep learning can be memory-bound. When sharing GPU capacity across these different applications, we sometimes conservatively allocate resources in order to ensure fairness and equal distribution across applications. So this can result in lower utilization levels. Another challenge is open source libraries and optimizers and integration of that within Uber's production containers. We have Uber ML framework, Michelangelo, and we are looking at developing the end-to-end LLM support within that framework to enhance developer velocity. And that's work in progress. Memory expansion, both in capacity and bandwidth, has not kept pace with the scale at which the model hyperparameters have grown. And that's a big challenge. Probably you have heard that many times in this conference so far. And last, but probably the most important one, is scaling of the network resources bandwidth in both scale-out and scale-up setups. To enhance the training efficiency, this is needed.

Now we'll delve into a couple of case studies that talk about these challenges. The chart on the left depicts the Uber ML's GPU fleet, reservations, and allocations within the past month. What's evident from this chart is that our allocation rates are very modest and much lower than where we would like it to be. And the reason for this is the newer workloads are requiring more and more system memory per GPU worker than what is currently available on the systems. The inherent limitations of the number of DIMM channels on the server prevent from scaling up these allocations and unlocking their efficiency. This is precisely where CXL memory expansion can help in unlocking further GPU allocations, increasing the memory, and further improving our efficiency.

Here we show the summary of our results for a large language model use case and the significant impact of network improvement, both in terms of bandwidth and congestion control management in getting better training efficiency. What we see here is a twofold improvement in throughput for training and a significantly reduced training times when we use NVIDIA's DGX setup, which is equipped with InfiniBand congestion control management techniques and even higher link bandwidth compared to our on-prem machines with Ethernet and a lower link bandwidth. When we do multi-node training, there is a lot of duplication of data that happens across nodes that puts pressure on local memory and even on the network I/O. And we believe this is the place where CXL memory, pooled memory systems can help us in scaling memory across various systems.

Now we look at our design framework and experimentation approach for offloading optimizer states for a large language model from like Meta's Lama 2 model from GPU memory to CPU and NVMe devices. Our objective is to assess the impact of this offloading of parameters and optimizer states from GPU to NVMe devices and CPU memory to look at the GPU scalability, training efficiency, and some other system metrics. For this setup, we have used our on-prem ML machines equipped with NVIDIA's four A100s and Intel Ice Lake CPU. We have a cluster of 32 GPUs across eight nodes, uses Ethernet fabric. So from here on, my colleague, Bo, will go over in detail about the LLM training architecture and describe our experimentation results. So over to you, Bo. Take it away.

Hi, everyone. So now I will talk about our larger model training and also our efforts to optimize the training efficiencies. So first of all, let me first sketch our larger model training architectures. So at the top level, we use Ray, Ray Trainer, to do the distributed training, because Ray provides async API to perform a distributed training. And Ray can configure all the distributed process group for us. At the middle level, we are using Hyggingface Transformers and also Microsoft DeepSpeed. And the Hugging Face Transformers has a very nice API to download model and perform the training loop, update the gradients. And DeepSpeed is a very nice library to the model parallelism. It contains ZeRO-Infinity with several stages. At the bottom level, we're using Torch to do the distributed training, because Torch basically supports most of the state of the art, natural language model and the larger language models. Also, we use NCCL to perform the multi-node, multi-GPU network communications.

Now let me talk about our training pipeline design. Our training pipeline design is actually very simple. Basically, it's reading data from storage, distributed training, and then model checkpoint management. So for distributed training, as we said, we use Ray to manage multiple workers. And in each worker, we use Hugging Face Transformer, including Hugging Face X-Ray and the Microsoft DeepSpeed, to do the gradient updates and also manage the partition states. Partition states are basically the states after you do model parallelism.

So let's talk about why we need model parallelism for larger language model training. So distributed training, basically, based on my understanding, there are four scenarios you can use, which is data parallelism and also three stages in DeepSpeed ZeRO-Infinity. It depends on how much you want to copy your model states across the different GPUs. So without model parallelism, you can do simple computations. Each billion of parameters requires 16 gigabytes to train. So if you want to train a 3B, which is a sequence-to-sequence smallest-to-larger model, you can think it's a T5-3B, you need a 48 GB memory to train. So if you want to train a larger model with even more GPUs, let's say, so with a 180 GPU memory, the larger model you can train is just a 5 billion model parameters. Instead, if you enable the model parallelisms, the ZeRO-Infinity stages can shard your model states, including model parameter gradients and optimizer states across the different GPUs. So it can scale to any arbitrary number of billion parameters models, as long as you have enough GPUs.

So given that, I want to talk about our experimentation training on Lama 2, 70 billion models, and our optimization, and share our findings. So as I said, for large-to-large training, you definitely need a model parallelism. Actually, you find offloading provided by Microsoft DeepSpeed is actually also very important. Here, the left panel, I have several graphs. But all the left panel is training using GPU memory only. And the right panel is with CPU or NVMe offload. The first graph is about CPU memory usage. Here, we can see if you just train on purely GPU, your CPU memory is actually very low, which actually is not a good thing, because you are wasting GPU resources. But if you use a CPU offload, actually, your CPU memory can be fully well-utilized.

If you check GPU memory, if you train just using GPU, then you almost meet the capacity of GPU memory. And what we find is that most of the large-to-large model training actually bounded by GPU memory, because you cannot increase your batch size a lot. Otherwise, you will get a GPU memory error. But with a CPU offload, basically, you can reduce your GPU memory usage. Actually, here, we even increase the batch size. The batch size, we increase it from 2 to 6. We still see that GPU memory usage with offload still 34% less than using GPU training purely.

Another thing I want to show is our GPU utilization. And with CPU offload, we do observe some GPU optimization has some deeps. And we suspect it's because we need less computer communication. That's why-- and also, our batch size is larger. That's why we have some kind of deeps. But that, we need a further investigation how to optimize the Deepspeed.

Most importantly, actually, one thing we find is if you use a CPU offload, your network throughput requirements are actually less. That actually can reduce the inter-node communication overhead during large-language model training.

So here, I want to go to our result summary. So first of all, we are able to train larger models that cannot fit with a limited scale of GPUs, especially when turned on the offload. And also, offload can result in better training efficiency, because we can enlarge the larger batch size. Otherwise, your batch size is limited by the GPU memory boundary. So we actually find that, as we said, the GPU memory is used 34% less. But the MFU, which is the metric to measure training efficiency, actually more than two times larger. And we also find that the network throughput actually is smaller if we turn on the CPU offload. And so the high bandwidth and the low latency CXL device could maybe further improve the training times.

OK. Here's some next steps on our ongoing effort. So first of all, we need to improve the training stability when using Deepspeed offload. So the Deepspeed library is very rich. It has a lot of offload operations. Now we still try to figure out the best way to improve the training efficiency. Also, we need to continue to try the larger models. Right now, we are trying Lama2 70 billion. But now, these days, there's another open source larger model. It's Falcon 180 billion. So we need to try that one. And also, we need to try larger cluster size to see whether our assumption with offload can hold or not. So after model training, we need to run batch inference. So right now, we try to optimize batch inference with offload as well. Because if we increase your model a lot, and it cannot feed a single node, let's say, 32, 320 gigabytes with a full GPU in one node, then you definitely need a model parallelism to the model inference. The other part that we already get a kind of exciting result when we try to do reinforcement learning. So next thing, we need to evaluate the training on CXL attached memory to see whether we can further get some improvement of our training efficiency. With all that, I will give back to my colleague, Nick, for the conclusions.

Thank you, Bo. So to wrap up, we'd like to leave you with three key insights. It is hard to do an efficient AI/ML system design in the face of rapid evolution of models and applications. And hence, scaling of infrastructure efficiently requires hardware software co-design across different layers of the stack, and innovation is needed. Through these experiments, we have demonstrated the impact of scaling up memory, network resources, and even memory offloading on enhancing the training efficiency for generative AI models like Meta's Lama 2 model. And lastly, we'd like to extend a call to action and urge all of us to actively participate in CXL enablement for AI/ML workloads, foster industry partnerships, and leverage open source optimizations. Together, we can drive innovation and tackle the evolving demands of Gen AI use cases. With that, thank you for your time, And enjoy the rest of the conference.