11868 LLM Sys: Systems for Mixture-of-Experts Models
Lec16 Systems for MoE Models
This lecture addresses the systems challenges behind Mixture-of-Experts (MoE) models: how sparse expert architectures scale model capacity without proportionally increasing compute, what communication patterns expert parallelism introduces, and how systems like GShard, DeepSpeed-MoE, and DeepSeek V3 solve routing, load balancing, and inference efficiency at scale.
Motivation: Why Sparse Models?
Dense models are expensive to scale. Training a 175B OPT model takes 992 A100 GPUs for 56 days; PaLM at 540B needs 6144 TPU v4 chips for 57 days. The compute cost grows roughly linearly with model size because every parameter participates in every forward pass.
| Model | Model Size | Hardware | Days to Train |
|---|---|---|---|
| Megatron-LM GPT-2 | 8.3B | 512 V100 GPU | 9.2 days |
| OPT | 175B | 992 A100 GPU | 56 days |
| MT-NLG | 530B | 2200 A100 GPU | 60 days |
| PaLM | 540B | 6144 TPU v4 | 57 days |
Sparse models offer a different scaling path: increase total parameters while keeping per-token compute constant. Mixture-of-Experts is the dominant sparse architecture for LLMs. MoE pretraining is much faster than dense models of equivalent quality, and MoE inference is faster than a dense model with the same total parameter count because only a subset of parameters activates per token.
Transformer Mixture-of-Experts Architecture
The core idea of Transformer MoE is simple: replace the single dense FFN in each Transformer block with multiple small expert FFNs and a gating network (router) that selects which expert(s) to activate for each input token.
This is distinct from the classical Mixture-of-Experts learning algorithm, which learns a weighted average of predictor models. In Transformer MoE, the experts are sub-networks within the model itself, and the router makes a hard or soft selection per token.
Switch Transformer
The Switch Transformer (Fedus et al., JMLR 2022) is the canonical MoE architecture. Each token is routed to exactly one selected FFN expert based on the router’s output:
- The gating network computes a probability distribution over experts:
\[G_\sigma(x) = \text{Softmax}(x \cdot W_g)\]
- The output is a weighted combination of expert outputs:
\[y = \sum_{i=1}^{n} G(x)_i E_i(x)\]
- Top-k gating keeps only the top-\(k\) expert scores and sets the rest to \(-\infty\):
\[\text{KeepTopK}(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \\ -\infty & \text{otherwise} \end{cases}\]
\[G(x) = \text{Softmax}(\text{KeepTopK}(H(x), k))\]
In the Switch Transformer specifically, \(k=1\): each token goes to exactly one expert. This simplifies routing and reduces communication overhead compared to top-2 designs.
Shared vs. Routed Experts
A refinement introduced in DeepSpeed-MoE and later adopted by DeepSeek is the shared-routed expert design:
| Expert Type | Behavior | Purpose |
|---|---|---|
| Shared Expert | Always activated for every token | Captures common knowledge shared across all inputs |
| Routed Expert | Conditionally activated by the router | Captures token-specific, specialized knowledge |
The shared expert acts as a fixed FFN that every token passes through, while the routed experts provide conditional capacity. The final output is the sum of the shared expert output and the weighted routed expert outputs:
\[\text{output} = \text{SharedFFN}(h) + \sum_{i \in \text{TopK}} g_i \cdot \text{Expert}_i(h)\]
where the routing weights are computed as \(\text{Softmax}(\text{TopK}(h_t \cdot W))\).
What Experts Learn
Encoder experts tend to specialize in token groups or shallow concepts: punctuation, proper nouns, sentinel tokens, verbs, visual descriptions, counting and numbers. Decoder experts exhibit less specialization. In multilingual setups, experts do not specialize by language — load balancing and token routing prevent that kind of clustering.
Activated Experts Differ Across Layers
Different tokens activate different experts at each layer. Token “red” might route to Expert 3 at layer 1 and Expert 1 at layer 2, while “fox” routes to Expert 2 and Expert N at the same layers. The routing pattern is dynamic and learned end-to-end.
Parameters of MoE
MoE models have fewer total parameters than the naive multiplication suggests. Mixtral 8x7B has 47B parameters, not 56B, because only the FFN layers are replicated as experts — attention layers, embeddings, and layer norms are shared across all experts.
Overfitting Risk
MoE models are prone to overfitting on small datasets. With many more parameters than a comparable dense model, the sparse model can memorize small training sets while the dense counterpart generalizes better. On larger tasks, MoE performs well. This is a key consideration when fine-tuning MoE models on downstream tasks.
Training and Inference for MoE
Expert Parallelism (GShard)
Expert parallelism (Lepikhin et al., ICLR 2021) is the standard approach to distributing MoE models across devices:
- Each expert lives on one worker device — experts are split across GPUs
- All other components (attention, embedding, layer norm) are replicated on every device
- All-to-all communication dispatches tokens to the correct expert device and collects results
The forward pass follows four steps:
- Route: each device runs its local router to determine which expert each token should go to
- All-to-All Dispatch: tokens are sent to the device hosting their assigned expert
- Expert Compute: each device runs its local expert FFN on the received tokens
- All-to-All Combine: expert outputs are sent back to the originating device
GShard also introduces interleaving: MoE layers alternate with standard dense FFN layers (every other layer uses MoE). This balances the communication overhead with the capacity benefits of sparse routing.
Token Computation Path in MoE
During inference with expert parallelism, each token’s computation path crosses device boundaries at the MoE layers. The token starts on its “home” device for embedding and attention, gets dispatched to a remote device for expert computation, then returns. Each device maintains its own KV-caches for the attention layers, but expert computation requires cross-device communication at every MoE layer.
Load Balancing in MoE Training
Without intervention, the router can collapse into sending all tokens to a few popular experts, leaving most experts undertrained. The load balancing loss prevents this:
\[L_{ExpBal} = \alpha_1 M \sum_{i=1}^{\text{\#experts}} f_i P_i\]
where: - \(f_i = \frac{\text{\#tokens to expert } i}{\text{\#tokens}}\) is the fraction of tokens routed to expert \(i\) - \(P_i = \frac{1}{\text{\#tokens}} \sum_{t=1}^{\text{\#tokens}} s_{i,t}\) is the average routing weight for expert \(i\) - \(M\) is the number of experts - \(\alpha_1\) is a hyperparameter controlling the balance loss strength
The product \(f_i \cdot P_i\) penalizes experts that receive both a high fraction of tokens and high routing probability. This loss is differentiable through \(P_i\) (the routing weights), encouraging the router to spread tokens more evenly.
MoE Inference Challenges
MoE inference performance depends on three factors:
- Overall model size — determines total memory footprint
- Number of activated experts — determines per-token compute
- Overall memory bandwidth — since only a subset of experts activate, inference is often memory-bandwidth-bound
The default implementation keeps all experts in GPU memory, which requires large memory capacity even though only a few experts are active per token.
Optimizing MoE Inference
The system design goal for MoE inference is to minimize the critical data path per device and maximize aggregate memory bandwidth. Key optimization strategies:
- Group and route tokens with the same critical data path together to reduce data access per device and achieve maximum aggregate bandwidth
- Optimize communication scheduling with parallelism coordination
- Optimize transformer and MoE-related kernels to improve per-device performance
Expert Parallelism and Tensor Parallelism (DeepSpeed-MoE)
DeepSpeed-MoE (Rajbhandari et al., 2022) combines multiple parallelism strategies for MoE models:
- Expert Parallelism / Expert Slicing: group all tokens assigned to the same experts under the same critical data path, and parallelize processing of token groups with different critical paths across different devices
- Tensor Parallelism / Tensor Slicing: partition the non-expert parameters (attention) across devices, usually within a node
- Data Parallelism: replicate the above setup across additional device groups
This multi-dimensional parallelism approach handles expert and non-expert parameters differently, applying the most appropriate parallelism strategy to each.
Optimizing MoE Kernels
MoE-specific kernel optimizations include:
- Fusing the gating function into a single kernel — avoids multiple kernel launches for the softmax, top-k, and dispatch operations
- Dense token-to-expert mapping table — enables efficient batched computation
These optimizations achieve over 6x reduction in MoE kernel-related latency.
Optimizing All-to-All Communication
Expert parallelism requires all-to-all communication between all expert parallel devices, and the latency increases linearly with the number of devices. Two key optimizations:
- Hierarchical all-to-all communication: reduces communication hops by first communicating within a node, then across nodes
- Parallelism-coordinated communication: schedules communications based on the model’s parallelism strategy to minimize overhead
MoE Training and Inference Results
DeepSpeed-MoE demonstrates that a 1.3B+MoE-128 model (52B total parameters) achieves the same validation loss as a 6.7B dense model while training at 5x higher throughput (372 vs. 70 samples/sec on 128 A100 GPUs).
For inference, DeepSpeed achieves significantly lower latency and higher throughput compared to PyTorch baseline across all GPU counts (8, 16, 32, 64 GPUs), with throughput scaling nearly linearly with GPU count.
DeepSeek MoE
Fine-Grained Expert Design
DeepSeek MoE (Dai et al., 2024) introduces fine-grained experts: each original FFN is split into \(k\) smaller experts, yielding \(kN\) total experts (where \(N\) is the original number of experts). This finer granularity allows more flexible routing — the model activates \(kM\) experts per token (where \(M\) is the base activation count), each smaller than in a standard MoE design.
The architecture combines shared experts with routed experts, and uses top-k weighted averaging of the activated routing experts.
DeepSeek V3 MoE (670B)
DeepSeek V3 is a 670B-parameter MoE model with the following configuration:
| Parameter | Value |
|---|---|
| Vocabulary | 129,280 |
| Hidden dimension | 7,168 |
| Number of layers | 61 |
| Dense layers (lowest) | 3 |
| Number of attention heads | 128 |
| FFN intermediate dimension | 18,432 |
| MoE dimension | 2,048 |
| Shared experts | 1 |
| Routed experts | 256 |
| Activated experts per token | 8 |
| Expert groups | 8 |
| Limited groups | 4 |
The architecture uses Multi-head Latent Attention (MLA) instead of standard multi-head attention. Each MoE layer has 1 shared expert FFN that processes every token, plus 256 routed experts of which 8 are activated per token. The routing uses \(\text{TopK}(\text{Softmax}(h_t \cdot W), 8)\).
Each expert uses the SwiGLU activation:
\[\text{FFN}_{SwiGLU}(x) = (\text{Swish}(x \cdot W_1) \odot (x \cdot W_2)) \cdot W_3\]
The first 3 layers are dense (standard FFN), and the remaining 58 layers use MoE. This design lets the lowest layers build general representations before introducing sparse routing.
Load Balancing in DeepSeek MoE
DeepSeek V3 uses two levels of load balancing loss:
Expert-Level Balance Loss (to avoid routing collapse):
\[L_{ExpBal} = \alpha_1 \sum_{i=1}^{\text{\#experts}} f_i P_i\]
where \(f_i = \frac{\text{\#experts}}{\text{\#activated\_experts}} \cdot \frac{\text{\#tokens to expert } i}{\text{\#tokens}}\) and \(P_i = \frac{1}{\text{\#tokens}} \sum_{t=1}^{\text{\#tokens}} s_{i,t}\) (the average routing weight).
Device-Level Balance Loss (to balance computation across devices):
\[L_{DevBal} = \alpha_2 \sum_{j=1}^{\text{\#groups}} f_j P_j\]
where \(f_j\) is the average \(f\) across experts in group \(j\), and \(P_j\) is the sum of \(P\) across experts in group \(j\).
The two-level design ensures both individual expert utilization and cross-device computational balance.
DeepSeek Libraries for MoE
DeepSeek has open-sourced two libraries to accelerate MoE:
- DeepEP: a communication library tailored for MoE and expert parallelism, optimizing the all-to-all dispatch and combine operations
- EPLB (Expert Parallelism Load Balancer): handles dynamic load balancing across expert parallel devices
DeepSpeed MoE Code Example
DeepSpeed provides a simple API for adding MoE layers:
1 | import torch |
The MoE class wraps an existing nn.Module
(the expert template) and handles replication, gating, and
expert-parallel communication. Key parameters include:
hidden_size: model hidden dimension (input and output dimension)expert: the base expert module (e.g., an MLP ortorch.nn.Linear)num_experts: total number of experts per layerep_size: number of ranks in the expert parallel groupk: top-k gating value (1 or 2)capacity_factor: controls expert capacity at training timedrop_tokens: whether to drop tokens exceeding expert capacity
Internally, the Experts class deep-copies the expert
module for each local expert, sets param.allreduce = False
on expert parameters (since they use expert parallelism, not data
parallelism), and chunks input for per-expert processing:
1 | class Experts(nn.Module): |
Lec16 Takeaways
- MoE scales capacity without scaling compute: by activating only a subset of experts per token, MoE models can have trillions of parameters while maintaining the per-token FLOPS of a much smaller dense model
- Expert parallelism is the natural distribution strategy: split experts across devices, replicate everything else, and use all-to-all communication to dispatch tokens — but all-to-all becomes the bottleneck at scale
- Load balancing is critical: without auxiliary losses, routers collapse to a few popular experts, wasting capacity and creating compute imbalance across devices
- DeepSeek V3 pushes fine-grained design: 256 routed experts with only 8 activated, shared experts for common knowledge, two-level balance losses, and MLA attention in a 670B model
- Multi-dimensional parallelism (EP + TP + DP) is necessary at scale, with each strategy applied to the part of the model it fits best
Final Summary
| Topic | Key Idea |
|---|---|
| Sparse vs. Dense | MoE increases model capacity without proportionally increasing per-token compute |
| Switch Transformer | routes each token to one expert via top-1 gating with softmax over expert scores |
| Shared vs. Routed Experts | shared experts capture common knowledge; routed experts specialize per token |
| Expert Parallelism (GShard) | split experts across devices, replicate non-expert params, use all-to-all communication |
| Load Balancing Loss | auxiliary loss penalizing uneven token distribution across experts to prevent routing collapse |
| MoE Inference | memory-bandwidth-bound; optimize by grouping tokens, fusing kernels, hierarchical all-to-all |
| DeepSpeed-MoE | combines expert parallelism, tensor parallelism, and data parallelism with optimized kernels |
| DeepSeek V3 (670B) | fine-grained experts (256 routed, 8 activated), shared expert, two-level balance loss, SwiGLU |
| DeepEP / EPLB | DeepSeek libraries for optimized MoE communication and dynamic expert load balancing |
Key takeaway: MoE is the dominant approach to scaling LLMs beyond the compute budget of dense models. The systems challenge shifts from parallelizing a uniform computation to orchestrating dynamic, token-dependent routing across devices — requiring expert parallelism with all-to-all communication, load balancing to prevent routing collapse, and multi-dimensional parallelism to handle both expert and non-expert parameters efficiently.
References
- Fedus et al. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” JMLR 2022.
- Lepikhin et al. “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” ICLR 2021.
- Rajbhandari et al. “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.” 2022.
- Dai, D. et al. “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models.” 2024.
- DeepSeek-V3 inference code: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py
- DeepEP: https://github.com/deepseek-ai/DeepEP
- EPLB: https://github.com/deepseek-ai/EPLB
This post is based on lecture materials from CMU 11-868 LLM Systems by Lei Li (System for MOE Models, Lec16).


