The rapid evolution of large language models (LLMs) has ushered in a new era of AI-powered applications, from conversational agents to advanced content generation. However, deploying these massive models at scale for real-time inference presents significant challenges, particularly in achieving high throughput, low latency, and better resource efficiency.
Our primary goal is to optimize key performance metrics:
These metrics highlight the distinct computational demands of LLM inference: Prefill is compute-intensive, while decoding is memory bandwidth-intensive. To address these challenges and enable the deployment of large models, we have developed and implemented advanced parallelism techniques.
A typical LLM generative-inference task unfolds in two stages:
To scale LLM inference effectively, especially for handling long contexts and massive models, we employ three main types of inference parallelism:
1. Tensor parallelism (TP), which improves fitting large models across multiple GPUs and achieving high throughput that a single device cannot provide. It involves sharding individual layers of the model, such as attention blocks and multi-layer perceptron (MLP) layers, into smaller, independent blocks that can be executed on different devices.
A challenge in tensor parallelism is the “allreduce” communication operation, which can contribute up to 30% of end-to-end latency. To mitigate this, we developed direct data access (DDA) algorithms:
Our DDA solutions demonstrate significant speedups against baselines such as NCCL (NVIDIA Collective Communications Library) and RCCL (ROCm Communication Collectives Library for AMD GPUs). For instance, with AMD MI300X, we achieved overall performance parity with Nvidia H100, with DDA outperforming RCCL baseline by 10-50% for decode (small message sizes) and yielding 10-30% speedup for prefill, resulting in approximately 10% reduction in TTIT.
2. Context parallelism (CP), which facilitates managing and processing extremely long contexts, such as the 1M/10M token capabilities introduced with Llama 4. Long-context inference presents unique challenges:
We have implemented two variants of context parallelism in the attention module, often referred to as “ring attention”:
Our context parallelism optimizations, combined with a fast-attention kernel, have enabled remarkable performance for long-context capabilities. We achieved less than one minute for one million tokens on a single H100 host and less than one minute for 10 million tokens using distributed inference across multiple H100 hosts (e.g., 32 H100 hosts). With Llama 3 405B, we demonstrated near-linear scaling, achieving 128K token prefill in 3.8 seconds with CP over 16 nodes, and 1M-token prefill in 77 seconds.
3. Expert parallelism (EP), which helps with scaling mixture-of-experts (MoE) models, where a large number of “experts” (neural network modules) make it impossible to fit the entire model onto a single host. In EP-based inference, we utilize a two-shot, all-to-all communication pattern to exchange tokens between data parallelism and expert parallelism ranks based on routing.
The all-to-all communication can contribute 10-30% to end-to-end latency, especially for decode messages (100KB to 2MB). To optimize this, we are exploring solutions including:
To further optimize LLM inference, we are moving towards N-D parallelism (CP, PP, EP, TP across nodes, with separate DP) and disaggregating prefill and decoding tiers. This allows for better resource balancing and the potential to use heterogeneous hardware, where compute-heavy hardware is used for prefill and memory bandwidth-heavy hardware for decoding. This multi-dimensional parallelism can help unblock the serving and evaluation of colossal models.
Future challenges in this space include:
These advancements in parallelization and system-level improvements have helped enable the next generation of AI applications and push the boundaries of what LLMs can achieve. We are committed to continuous innovation to ensure efficient and scalable LLM inference for millions of users worldwide.
The post Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism appeared first on Engineering at Meta.
Login to comment
Login0 Replies
0 Replies
0 Replies
0 Replies
0 Replies