MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism¶

🎯 Pitch¶

MegaScale-Infer introduces a novel system architecture for serving massive Mixture-of-Experts (MoE) large language models by decoupling attention and expert modules within each model layer, allowing them to be independently scaled and deployed on heterogeneous hardware. Through a tailored ping-pong pipeline parallelism and a custom many-to-many (M2N) communication library, the system dramatically boosts GPU utilization and slashes inference costs—achieving up to 1.9× higher throughput and 1.5–2.0× lower serving cost compared to state-of-the-art systems. This innovation addresses a key bottleneck in practical MoE deployment, unlocking significant efficiency gains vital for large-scale, cost-sensitive AI services.

1. Executive Summary¶

MegaScale-Infer is a system for serving Mixture-of-Experts (MoE) large language models that splits each model layer into two independently scaled parts—attention and experts (the FFN sub-networks)—and connects them with a high-performance many-to-many (M2N) network layer. It adds a “ping-pong” micro-batch pipeline that overlaps computation with communication and introduces a specialized M2N communication library to make token routing fast and stable. Across 132B–317B MoE models, MegaScale-Infer delivers up to 1.9× higher per-GPU decoding throughput than state-of-the-art serving systems and up to 1.86× higher throughput per unit cost under heterogeneous GPUs (Figures 8(a) and 9(a), Abstract).

2. Context and Motivation¶

Problem addressed
During inference, MoE sparsity routes each token to a small subset of experts (top-k). This reduces per-expert batch sizes during decoding, making the expert FFNs memory-bound and under-utilized on GPUs despite MoE’s compute savings (Section 2.3; Figure 1(b)).
Attention during decoding is inherently memory-bound (due to key–value cache lookups over all past tokens), while FFNs are compute-efficient only when batch size is large enough to amortize weight loads (Section 2.1; Figure 1(a)).
With realistic latency targets and KV-cache memory limits, the total batch size cannot be made arbitrarily large, so per-expert batches shrink as the number of experts grows—further depressing utilization (Section 2.3; Roofline analysis and the Mixtral 8x22B example).
Why it matters
Low GPU utilization inflates inference costs for large-scale MoE services. The paper targets cost and energy efficiency at production scale (nearly 10,000 GPUs; Section 8), with demonstrated 1.5–2.0× cost reductions in deployment.
Limitations of prior approaches
Traditional model parallelism (tensor parallelism and pipeline parallelism) does not address MoE-induced per-expert batch fragmentation and can add communication overhead (Section 2.2).
Expert parallelism improves expert GEMM shapes but requires expensive all-to-all token dispatch per MoE layer (Figure 2(b)).
Disaggregation for long-context dense models (e.g., Infinite-LLM) replicates attention to grow batch size, but it does not tackle MoE’s dynamic token routing and top-k sparsity, and assumes simpler communication (Section 2.4).
Positioning
MegaScale-Infer goes beyond prefill/decoding disaggregation used in prior systems by disaggregating within the layer: attention on one set of GPUs and experts on another (Figure 3). It then:
- Replicates attention (data parallelism) to aggregate requests.
- Scales experts (expert parallelism) to keep expert GEMMs compute-bound.
- Introduces a ping-pong micro-batch pipeline and a purpose-built M2N library to hide and stabilize token routing costs (Sections 3–5).

3. Technical Approach¶

The system’s core is “disaggregated expert parallelism”: separate attention and experts—each with its own parallelism strategy and hardware—then add a pipeline and communication layer that make this split efficient (Figure 3).

Architecture and parallelism choices
Attention nodes
- Replicate attention parameters and store the KV cache (data parallelism). Use tensor parallelism inside a node to exploit high intra-node bandwidth (NVLink) (Section 3).
- Rationale: decoding attention is memory-bound and benefits from large memory capacity/bandwidth and cheap replication.
Expert nodes
- Each node hosts parameters for one expert; all expert nodes form an expert-parallel group (Figure 3). Use tensor parallelism inside a node (Section 3).
- Rationale: experts should get as many aggregated tokens as possible, making their GEMMs compute-bound and efficient.
Ping-pong micro-batch pipeline (Figure 4; Section 4.1)
Problem: After disaggregation, attention and experts would be idle while waiting for each other or for network transfers.
Solution: Split the global batch into m micro-batches and shuttle them in a ping-pong fashion through attention → experts → attention for each MoE layer, twice per layer (A2E and E2A).
Key conditions to keep GPUs busy and hide communication (Equations (1)–(3)):
- Balance compute: Ta ≈ Te (attention vs expert time per micro-batch).
- Communication faster than compute: Tc < Tf, where Tf = max(Ta, Te).
- Enough micro-batches to fill the pipeline and cover two communications per layer:
- m × Tf ≥ 2 × (Tf + Tc) ⇒ m ≥ 2 × (1 + Tc/Tf).
- With fast links (Tc < 0.5 Tf), m ≥ 3 is sufficient; slower links need m ≥ 4 (Section 4.1).
Latency model (Equations (4)–(5)):
- Per-micro-batch iteration latency bounded by (Ta + Te + 2Tc) + m Tf (L − 1) ≤ Titer ≤ m Tf L.
- Total iteration latency for the global batch: Ttotal = (Ta + Te + 2Tc) + Tf (mL − 1).
Deployment plan search with a performance model (Algorithm 1; Sections 4.1–4.2)
Search space: tensor parallel sizes tpa (attention) and tpe (experts), number of attention nodes na, number of micro-batches m, and global batch size B (subject to a latency SLO).
Modeling compute times using GEMM arithmetic intensity and measured constants (Table 2; Section 4.2):
- Attention time per micro-batch Ta ≈ k1 ba + k2, with ba the per-attention micro-batch size; includes memory-bound KV-cache reads proportional to ba × s and TP sync overhead.
- Expert time per micro-batch Te ≈ k3 be + k4, with be the per-expert micro-batch size.
- Relationship: ba × m × na = be × m × E/K = B (total tokens conserved), so we set na ≈ (k1 E)/(k3 K) to balance Ta and Te (Constraint 1).
Modeling communication (Equation (6)):
- For A2E and E2A, time is the slower of send and receive:
- Tc = max{ (ba h K / tpa) / (Wa × Util(ba h K / tpa)), (be h / tpe) / (We × Util(be h / tpe)) },
- where Wa, We are per-GPU link bandwidths and Util(msg_size) is the measured bandwidth utilization curve vs. message size.
Constraints:
- Pipeline constraints (Equations (1)–(3)).
- SLO on time-between-tokens: Titer ≤ SLO (Equation (7)); SLO is set to 150 ms in evaluations (Section 7.1).
- GPU memory capacity for attention: 4 m ba s h L / g + 2 Pa < tpa Ca (Equation (8); GQA groups g; bfloat16; Pa is attention parameter size; Ca per-GPU memory).
Objective: maximize throughput per unit cost = (B / Ttotal) / (tpa na Costa + tpe E Coste) by simulating plans and picking the best (Algorithm 1; Section 4.2).
Practical bounds: Nm (max micro-batches) is 4—too many micro-batches degrade expert GEMM efficiency—and GPU-per-node choices are typically {1,2,4,8}, keeping search tractable (Algorithm 1 commentary).
High-performance M2N communication library (Section 5; Figures 6–7; Figure 5)
Why NCCL struggles for MoE-style M2N token dispatch:
- Extra GPU→CPU proxy copies (issue #852), group op batching limits (max 8 ops), and general per-group setup overhead inflate latency; tail latency spikes with more receivers (Figure 5).
- GPU-side synchronization and memory access add instability at high percentiles (Section 5; references [41,83]).
Design: CPU-driven RDMA with stream-aware blocking, avoiding GPU-to-CPU copies and GPU synchronizations.
- Sender flow (Figure 6): wait on CUDA event (previous kernel), block the CUDA stream using cuStreamWaitValue32, do RDMA write-with-immediate from pre-registered GPU buffers, poll completion queue, then unblock the CUDA stream via a shared flag.
- Receiver flow (Figure 7): ensure target buffer is free, block stream, poll CQ to ensure arrival, perform a GDRCopy-based flush for GPU visibility, then unblock the stream.
- Traffic tuning: prioritize ACKs on separate high-priority queues to prevent head-of-line blocking; adjust congestion control for unbalanced traffic (Section 5).
Comparison to DeepEP (GPU-to-GPU comms):
- GPU kernels can push higher packet rates for tiny messages but consume SMs and need intricate low-level tuning (e.g., PTX, L2 usage). In MegaScale-Infer’s regime (hundreds of KB per sender–receiver pair), a CPU thread saturates NIC bandwidth while keeping GPUs fully available for compute (Section 5).
Additional implementation details
Fused kernels: fuse all-gather with subsequent GEMM using Flux to overlap intra-node TP comms with compute; fuse gating, top-k selection, token scatter preparation, and related memory-bound steps (Section 6).
Expert load balancing: on-device redundancy for “hot” experts using a greedy approximation that minimizes the max per-node cost max_j Cj with allocation fractions x_{i,j} and expert activity costs a_i (Section 6).
Code footprint: about 4.9k C/C++ and 5k Python LOC for the M2N library (Section 6).
Heterogeneous deployment (Section 4.3; Table 3)
Map attention to GPUs with high per-cost memory capacity and bandwidth (e.g., H20), and experts to GPUs with high per-cost compute (e.g., L40S). Table 3 quantifies GB/GB/s/TFLOPS per dollar.
Also improves throughput per watt because H20 and L40S respectively offer efficient bandwidth and compute per power (Section 4.3; Figure 10).

4. Key Insights and Innovations¶

Disaggregated expert parallelism is a new within-layer split for MoE serving.
What’s new: previous work disaggregates phases (prefill vs decoding); MegaScale-Infer disaggregates modules within each layer (attention vs experts) and scales them independently (Figure 3).
Why it matters: it converts sparse, memory-bound expert computation into compute-bound GEMMs by aggregating tokens from many attention replicas, unlocking FFN efficiency (Sections 2.3 and 3).
Ping-pong pipeline that provably hides communication
What’s new: a micro-batch pipeline that enforces concrete conditions (Equations (1)–(3)) to overlap two bidirectional communications per MoE layer with compute (Figure 4).
Why it matters: it keeps both sides busy despite per-layer token routing, which would otherwise create frequent bubbles.
A purpose-built, CPU-driven M2N library for token routing
What’s new: a stream-aware, RDMA write-with-immediate design that avoids GPU-to-CPU copies, NCCL group overheads, and GPU synchronization; adds traffic-aware ACK prioritization and congestion control tuning (Section 5; Figures 6–7).
Why it matters: drastically lowers both median and tail latencies and improves throughput for the large-message regime typical in MoE routing (Figures 11–12), enabling communication to be fully hidden by the ping-pong pipeline.
Heterogeneous deployment that matches hardware strengths to module characteristics
What’s new: formalizes attention-on-memory-rich GPUs and experts-on-compute-efficient GPUs, then evaluates end-to-end cost and power (Section 4.3; Table 3; Figures 9–10).
Why it matters: yields up to 3.24× higher decoding throughput per cost versus strong baselines on H20 and 1.80× higher decoding throughput per watt (Figures 9(a) and 10(a)).
A practical, search-based deployment planner grounded in a simple, profile-calibrated performance model
What’s new: closed-form constraints and simple linear models for compute plus an empirical link model for communication enable a fast search of tpa, tpe, na, m, B under SLO and memory constraints (Algorithm 1; Section 4.2).
Why it matters: finds configurations where Ta ≈ Te and m suffices to hide communication, maximizing throughput per dollar given real hardware and workloads.

5. Experimental Analysis¶

Setup (Section 7.1)
Hardware
- Homogeneous: 8 nodes with 8× 80GB Ampere GPUs each, NVLink (400 GB/s intra-node), 8× 200 Gbps NICs per node.
- Heterogeneous: H20 nodes (900 GB/s NVLink, 4× 400 Gbps NICs) and L40S nodes (PCIe intra-node, 2× 400 Gbps NICs).
Models (Table 4)
- Mixtral-8×22B (141B params, 8 experts, top-2), DBRX (132B params, 16 experts, top-4), Scaled-MoE (317B params, 32 experts, top-4).
Workload: in-house production traces; median input length 571 tokens, output length 159; bfloat16 weights/activations/KV (Section 7.1).
Metrics
- Primary: decoding throughput (tokens/s) per GPU for homogeneous, and per unit cost for heterogeneous; latency SLO is TBT ≤ 150 ms (Section 7.1).
- Also: end-to-end throughput including prefill; throughput per unit power; M2N microbench latencies/throughput (Section 7.1).
Baselines: vLLM and TensorRT-LLM (both with TP/PP; TRT-LLM also supports EP). Prefill/decoding are evaluated separately for fairness (Section 7.1).
Main results
Homogeneous decoding throughput (Figure 8(a))
- MegaScale-Infer vs baselines:
- Mixtral-8×22B and DBRX: up to 2.56× higher per-GPU decoding throughput over vLLM and 1.28× over TensorRT-LLM.
- Scaled-MoE (multi-node): 7.11× over vLLM and 1.90× over TensorRT-LLM.
- Interpretation: disaggregation and ping-pong overlap sustain FFN utilization even at scale, while baselines suffer from inter-node overhead and per-expert batch shrinkage.
Latency (TBT) (Figure 8(b))
- Despite adding cross-node comms per layer, mean TBT is comparable to baselines, indicating communication is largely hidden by the pipeline and M2N efficiencies.
End-to-end throughput (prefill + decoding) on homogeneous GPUs (Figure 8(c))
- Gains are smaller (up to 1.18×) because prefill is compute-bound and not improved by the decoding-focused design; still shows net benefits.
Heterogeneous decoding throughput per cost (Figure 9(a))
- With attention on H20 and experts on L40S, MegaScale-Infer achieves up to 3.24× (vs vLLM on H20) and 1.86× (vs TensorRT-LLM on H20) higher throughput per dollar.
- Mean TBT remains comparable or slightly better than L40S-only baselines (Figure 9(b)).
Heterogeneous end-to-end throughput per cost (Figure 9(c))
- Offloading expert compute to L40S (cheaper compute) yields up to 1.66× end-to-end throughput per cost improvement versus H20 baselines.
Throughput per watt (Figure 10)
- MegaScale-Infer achieves 1.80× (decoding) and 1.72× (end-to-end) higher throughput per unit power due to matching module characteristics to energy-efficient hardware.
M2N microbenchmarks (Section 7.3; Figures 11–12)
Varying message sizes (2 KB–8 MB), with M=N=8:
- Median latency reduced by up to 80.8% and P99 by up to 96.2% vs NCCL; throughput improves by up to 9.9× (Figure 11).
- For the typical 256 KB size, median latency −68.2%, P99 −92.9%, throughput +4.2× (Figure 11).
Varying number of senders/receivers (M=N=4–32) at 256 KB:
- Tail latency consistently lower (−54.7% to −96.9%), throughput +3.3× to +5.8× (Figure 12).
Takeaway: the library’s design choices and traffic tuning materially stabilize and accelerate token dispatch at the scales and sizes relevant to MoE inference.
Ablations and diagnostics
Value of disaggregation and M2N (Figure 13)
- Disaggregation alone (with NCCL) yields up to 4.66× over a colocated baseline by aggregating tokens across attention replicas.
- Replacing NCCL with the custom M2N adds up to another 1.53× by hiding comms fully (meeting Tc < Tf).
Effect of micro-batch count m (Figure 14)
- m=1 (no pipeline) under-utilizes GPUs; m=2 gives ~1.9× throughput; m=3 allows overlap of comm and compute, adding 1.10×/1.28×/1.38× more for Mixtral/DBRX/Scaled-MoE; larger m shows diminishing returns in a high-bandwidth testbed.
Choosing the right attention replication (Figure 15)
- For DBRX, increasing attention DP from 1→8 shifts the bottleneck from attention to experts, maximizing normalized throughput without raising TBT. Further DP increases hurt by idling attention while experts compute—evidence for the “Ta ≈ Te” balance rule (Constraint 1).
Deployment evidence (Section 8; Figure 16)
Production-scale deployment (∼10k GPUs) reduces cost by 1.5–2.0×.
Real traffic shows large expert load skew (Figure 16(a)); decoding expert loads are stable over time while prefill is more volatile (Figures 16(b)–(c)), motivating static/periodic balancing for decoding and more frequent adjustments for prefill.
Attention load imbalance arises from variable sequence lengths; they batch to a target per-node compute time using profiled operator runtime curves (Section 8).
Overall assessment
The experimental design isolates decoding (where the method’s benefits accrue) and provides ablations that tie observed gains to the proposed mechanisms (pipeline fill, comm-latency reduction, balance of Ta and Te).
Results are consistent across models, scales, and hardware, and the microbenchmarks validate the communication substrate that underpins the pipeline-overlap claim.

6. Limitations and Trade-offs¶

Balance and pipeline assumptions
The ping-pong pipeline relies on Ta ≈ Te and Tc < Tf. When compute balance or communication regimes change (e.g., very slow networks or very small messages due to extremely high expert counts), the overlap can break down or require m ≥ 4 (Equations (1)–(3); Section 4.1).
The planner depends on profiling-derived constants (k1..k4) and measured bandwidth utilization curves; workload drift or software updates can invalidate them, requiring periodic re-profiling (Section 4.2).
Specialized communication stack and hardware
The M2N library assumes RDMA, GPUDirect, and GDRCopy are available and well-tuned; not all deployments have these capabilities (Section 6).
CPU-driven communication wins at hundreds of KB per connection (their regime), but for very small messages and very high degrees, a GPU-driven approach like DeepEP could outperform it (Section 5, “Comparison with DeepEP”).
Scope focus on decoding
The largest gains come during decoding. Prefill benefits mainly via heterogeneous cost savings, not raw speedups (Figures 8(c), 9(c)).
Memory footprint and replication
Attention replication across na nodes increases total memory for attention parameters and KV caches (Equation (8)), potentially limiting max batch sizes on smaller-memory GPUs (Section 4.2).
Load-imbalance dynamics
Expert popularity skews change over time; the paper proposes on-device redundancy and periodic plans but does not detail an online reactive scheme (Section 6; Section 8, Figure 16).

7. Implications and Future Directions¶

Changing the design space of MoE serving
Moving from monolithic, layer-colocated serving to module-level disaggregation enables independent scaling and hardware specialization. This sets a template for disaggregating other layer components (e.g., attention variants, normalization, or MoE variants).
Practical deployment guidance
The conditions and model (Equations (1)–(8)) offer a practitioner’s checklist: balance Ta/Te, ensure Tc < Tf, pick m ≥ 3 or 4, and select na and tpa/tpe to satisfy memory and SLO constraints. Table 3 plus Figures 9–10 provide a concrete rationale for pairing memory-rich GPUs with attention and compute-efficient GPUs with experts.
Follow-on research opportunities
Adaptive runtime planning: close the loop by continuously re-estimating k1..k4, bandwidth utilization curves, expert hotness, and sequence length distributions to re-optimize na, m, B, tpa, tpe online.
Hybrid CPU/GPU communication: dynamically choose CPU- vs GPU-driven dispatch depending on message size and degree, potentially leveraging both DeepEP-like GPU queues and MegaScale’s CPU RDMA for different layers or traffic regimes (Section 5).
Scheduling and fairness: integrate sequence-length-aware packing and expert hotness into a global scheduler that co-optimizes pipeline utilization, latency SLOs, and multi-tenant isolation.
Extending to training or fine-tuning: the same disaggregation and M2N mechanisms could be adapted to optimize MoE training steps that involve frequent token re-partitioning.
Hardware co-design: given the success of ACK prioritization and congestion tuning (Section 5), NIC/driver features for token-routing collectives (M2N/N2M) could further reduce tail latencies and CPU overhead.
Applications
Cost-optimized production serving of very large MoE LLMs.
Cloud providers combining heterogeneous GPU fleets (e.g., H20 and L40S) to improve both cost and power efficiency (Figures 9–10).
Long-context serving alongside MoE (prefill/decoding and attention/expert disaggregation are compatible), and scenarios with bursty expert popularity where on-device redundancy helps.

Overall, MegaScale-Infer demonstrates that splitting attention and experts across independently scaled, possibly heterogeneous hardware—then carefully filling a micro-batch pipeline while stabilizing token routing—can transform MoE decoding from memory-bound inefficiency into high-utilization, cost-effective serving (Figures 1(c), 8–10, 11–12).