Performance bottlenecks in deploying LLMs—a primer for ML researchers

11 min readMay 10, 2023

by Lucia Mocz, Ph.D.

This is the first post in a series to help researchers understand the systems-level design choices involved in deploying LLMs.

With the rise of Large Language Models (LLMs) come new challenges that must be addressed, especially when it comes to deploying these models at scale. Unlike traditional convolutional neural networks (CNNs) or recurrent neural networks (RNNs), LLMs require an enormous amount of computational power, memory, and data to train effectively. Learning to reason about systems-level design choices to accelerate the speed and efficiency of training is therefore critical for an ML researcher entering this space.

In this post, we explore problems involved in LLM deployment, from GPU shortages to bottlenecks in model performance. These problems have inspired recent developments in distributed training frameworks commonly used to train LLMs, notably ZeRO-Offload. Here we give an overview of ZeRO-Offload, and in future posts we describe its benefits in depth.

Summary of performance bottlenecks found in deploying LLMs, paired with solutions offered by ZeRO-Offload.

The State of the World: A Global GPU Shortage

Training an LLM requires a lot of computational power, and likely a cluster of high-end GPUs or specialized hardware like Google’s TPU. The exact hardware requirements will depend on the size of the model and the size of your dataset, though for both training and inference you may be looking at a cluster of NVIDIA A100s. (NVIDIA A100s are newer and more expensive than the NVIDIA V100s, but have twice as many TensorCores and significantly larger memory—we will do precise computations with both of these in a bit.) Unfortunately, there is a global scarcity of GPU resources, a problem which started as early as 2020 and has accelerated due to an increased demand for machine learning efforts [1]. With increased demands for compute resources, you might wait months for access to appropriate resources.

To get a very rough estimate of the computational resources required for deploying an LLM, let’s first consider the task of doing inference (i.e., serving predictions) on a 175B parameter LLM, such as GPT-3. Without any systems-level optimizations, such as quantization, the memory requirements to store the parameters alone can be estimated to be:

700 GB = 175 billion parameters × 4 bytes per parameter

To do inference requires that we further compute and store intermediate values, which take up roughly the same amount of memory as the model parameters even if your batch size is small. Assuming that we have access to NVIDIA A100s with 80 GB memory on each, we would still require nearly 18 high-end GPUs to do inference alone!

Now suppose that instead we want to train a 175B parameter LLM with 300B tokens. OpenAI estimates on the order of 3.14 × 10²³ floating-point operations (FLOPs) for training their model [2], though with checkpointing we would estimate on the order of 4.2 × 10²³ FLOPs. The peak throughput for an NVIDIA A100 at single-precision is 1.56 × 10¹⁴ FLOPs per second for those computations formatted for a TensorCore, and 1.95 × 10¹³ FLOPs per second otherwise [3]. Following these specs, without any optimizations, we may (generously) estimate an average throughput of 5.0 × 10¹³ to 1.0 × 10¹⁴ FLOPs per second for actual performance. Assuming that we have an entire month (31 days) to wait for a training task, we would still require at least 1500–3000 high-end GPUs to train the model!¹

The global GPU shortage is unlikely to be solved in the near future, but even if it were, GPUs are still expensive, carry an enormous carbon footprint, and carry their own security concerns, all of which limit the impact of an ML researcher’s ability to advance the field and determine the full capability of LLMs. Optimal distribution strategies for LLM training can improve the problem significantly. For instance, the largest model that can be trained on a single NVIDIA V100 without optimizations has 1.2B parameters and can be trained at a throughput of 3.0 × 10¹³ FLOPs per second, while using a SOTA optimization technique such as ZeRO-Offload allows one to train on the same GPU a model with 13B parameters at a throughput of 4.0 × 10¹³ FLOPs per second [4]. This is a significant improvement!

To understand more effectively what is going on under the hood to achieve these performance gains, we need to start by looking at the source of potential bottlenecks in systems-level performance and how general strategies, such as mixed-precision training or CPU offloading, can be introduced to reduce the computational costs and improve the performance of these models.


¹ In reality, OpenAI trained their model on a cluster of 10000 NVIDIA V100s [5].

A Short Primer on Model Performance Bottlenecks

In the context of training and deploying deep learning models (such as LLMs), bottlenecks refer to limitations or constraints that hinder the performance and scalability of the training process. They are often not possible to overcome by improving on algorithmic performance alone. We can roughly divide potential sources of bottlenecks into three categories: compute, bandwidth, and overhead [6]. We will look at each of these bottlenecks in the context of the 65B parameter LLaMa model [7].

Compute bottlenecks

Compute bottlenecks refer to limitations in the computational resources, such as GPUs, that are required for training LLMs. LLMs are typically massive models with billions of parameters, and, as we have already noted, training them requires a significant amount of computation. Compute bottlenecks can occur when the available computational resources, such as GPUs, are not sufficient to handle the computational demands for their training. This results in slower training times, longer training iterations, and ultimately lower overall performance.

The rate of compute (FLOPs per second) for a GPU is determined by the number of processing cores. A high-end GPU such as an NVIDIA A100 has ~6000 TensorCores. An LLM on the other hand typically has billions of parameters and billions if not more than a trillion tokens, yielding orders of magnitude difference in compute.

How to compute: To estimate the compute bottlenecks, you need to calculate the number of floating-point operations (FLOPs) required to train your model. For an LLM, this can be calculated with the following formula [8]:

Note: If we include checkpointing, the factor of 6 should be changed to 8. Most very large models will require checkpointing.

Once you have computed the FLOPs, you estimate the computation bottlenecks by dividing this value by the compute capacity of your system, which is typically measured in FLOPs per second. If the number of FLOPs required for training is greater than the compute capacity of your system, you may experience compute bottlenecks.

Example: The 65B parameter LLaMa model was trained on 1.4 trillion tokens. Using the formula, we can thus estimate the FLOPs for the 65B parameter LLaMa model to be:

Note: This would be closer to 7.28 × 10²³ if we include checkpointing.

If trained on a cluster of 2048 NVIDIA A100 GPUs, each having mixed-precision performance peaking between 3.12 × 10¹⁴ and 6.24 × 10¹⁴ FLOPs per second, we would estimate at least 5 to 10 days of training, which is a considerable compute bottleneck.

Bandwidth bottlenecks

Bandwidth bottlenecks refer to limitations in the data transfer rate or network bandwidth that is used for communication between different components of a distributed LLM training setup. In distributed training, gradients and model updates need to be exchanged frequently between GPUs or nodes, which can result in high communication overhead. Bandwidth bottlenecks can occur when the available network bandwidth is not sufficient to handle the large amount of data that needs to be transferred during training. These lead to increased communication time, slower convergence, and reduced scalability of the training process.

Bandwidth (represented by the arrows) is measured by the amount of data transferred per unit of time. The available bandwidth varies across different parts of a model (represented by the width of the arrows). It is important to carefully design the communication strategy between GPUs, i.e. model synchronization, to ensure efficient use of available bandwidth.

How to Compute: To estimate bandwidth bottlenecks, you need to calculate the total amount of data that needs to be transferred between different nodes. For each update, this can be calculated with the following formula:

Once you have computed the data transfer, you estimate the bandwidth bottlenecks by dividing this value by the bandwidth of your system, which is typically measured in bytes per second. If the data transfer is greater than the available bandwidth, you may experience bandwidth bottlenecks.

Example: The reported batch size for the 65B LLaMa model was 4M tokens, which results in 350,000 updates, since the model was trained on 1.4 trillion tokens. Assuming that the gradients are stored as 16-bit floating point values, the total size of an update can roughly be estimated as (2 bytes) × (6.5 × 10¹⁰) = 1.3 × 10¹¹ bytes.² Finally, the 65B parameter LLaMa model was trained on a cluster of 2048 NVIDIA A100 GPUs. Using the formula, we can thus estimate the data transfer for the 65B parameter LLaMa model across 2048 nodes to be:

Assuming we have a very high-speed interconnect of 2 × 10¹¹ bytes per second, we would still expect at least 5 days spent on data transfer (of the gradients) alone, which is a considerable bandwidth bottleneck.


² If we are using a momentum-based optimizer such as Adam, then we should estimate this as 3.72 x 10¹¹ bytes since two momentum updates need to be transferred in addition to the gradient update.

Overhead bottlenecks

Overhead bottlenecks refer to limitations in the overhead or additional computational costs associated with synchronization, coordination, and communication during LLM training that are not directly related to transferring data or performing computations. These can include time spent in the Python interpreter, the Pytorch framework, or launching CUDA kernels without executing them. Modern GPUs are extremely fast, while Python and other overhead sources are relatively slow, which means that in the time it takes Python to perform one FLOP, a GPU could have performed millions or even billions of FLOPs. This can introduce delays and additional computational costs, impacting the overall training performance.

Rough estimation of the proportional performance breakdown for iteration in PyTorch. Computation is the proportion of time to actually do vectorized addition, Python is the proportion of time to parse Python objects, Dispatcher is the proportion of time to determine execution, and Operator Setup is everything remaining, such as error checking, data type checking, output allocation, etc. [9]

How to Compute: To estimate overhead bottlenecks, you need to calculate the total execution time that is not used directly for training. This can be calculated with the following formula:

Unlike the memory and bandwidth bottlenecks, overhead bottlenecks typically do not scale with the model size, so you estimate the overhead bottlenecks by varying the size of the dataset and experimentally measuring performance. The estimated training time can roughly be found by adding the estimations for the memory and bandwidth bottlenecks. If the ratio of overhead time to training time is too high, you may experience overhead bottlenecks.

Example: The reported execution time for training the 65B LLaMa model was 21 days. With our lower-bound rough estimates for the memory and bandwidth bottlenecks, we have a (very) rough upper bound estimate on the overhead time bottleneck to be between 6 to 11 days (an overestimation). We can improve this estimate by further detailing estimates on the memory and bandwidth bottlenecks, but the most accurate measurement for the overhead bottleneck is obtained by comparing estimations at varying sizes of the dataset.

To reduce the overall cost and to improve the performance of LLMs and allow us to further scale these models, it is crucial to address these bottlenecks. We now introduce high-level engineering strategies in the context of ZeRO-Offload to address these challenges.

ZeRO-Offload: An Optimal Solution

ZeRO-Offload (ZeRO short for “Zero Redundancy Optimizer’’) is a distributed training technique developed by Microsoft that aims to improve the performance and scalability of LLM training. It is the current SOTA for typical large model training [4]. At a high level, it addresses the challenges we have seen in the following ways.

Improve Compute Efficiency: ZeRO-Offload uses a technique called “model parallelism’’ to efficiently divide the LLM across multiple GPUs and across a GPU and CPU. This allows each processing unit to compute on a smaller portion of the model independently and reduces the memory footprint per processing unit. Breaking up the training into smaller chunks further enables the training of larger models, which may not fit in the memory of a single GPU or may benefit from CPU computation. It additionally allows effective use of quantization, such as in mixed-precision training, which further reduces the total memory footprint. Finally, ZeRO-Offload minimizes redundant computation by only computing gradients for the parts of the model that are updated, which reduces unnecessary computations and speeds up the training process.

Reduce Communication Overhead: Traditional distributed training techniques often suffer from high communication overhead due to the need to exchange gradients and model updates frequently. ZeRO-Offload addresses this by offloading the optimizer’s state to a dedicated CPU, which then communicates only the necessary updates to the GPUs. This reduces the amount of data that needs to be transferred over the network, which alleviates bandwidth constraints and improves the scalability of LLM training across multiple GPUs and nodes.

Alleviate Overhead: ZeRO-Offload minimizes the overhead associated with synchronization and coordination by asynchronously updating the optimizer’s state on the CPU. This allows the GPUs to continue to perform computations without waiting for global synchronization and reduces the incurred overhead time during communication and coordination, resulting in faster and more efficient training.

In the next installments, we will delve into each of these points to understand on a deeper technical level how these strategies are implemented and to quantify the performance gains.


[1] The Information. AI Developers Stymied by Server Shortage at AWS, Microsoft, Google. The Information, April 2022.

[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners, 2020. arXiV:2005.14165.

[3] NVIDIA. NVIDIA A100 Datasheet., May 2020.

[4] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale Model Training, 2021. arXiV:2101.06840.

[5] Microsoft. Microsoft and OpenAI Build New Azure AI Supercomputer., 2020. Accessed: May 5, 2023.

[6] Horace He. Making Deep Learning Go Brrrr from First Principles., 2022.

[7] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMa: Open and Efficient Foundation Language Models, 2023. arXiV:2302.13971.

[8] Dzmitry Bahdanau. The Flops Calculus of Language Model Training., August 2020.

[9] PyTorch Community. Comparing the Performance of 0.4.1 and master., 2018. Accessed on May 9, 2023.

About the author

Lucia Mocz, Ph.D. is an Applied Scientist at Preemo.
Technical editor: Jess Lin