SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations¶

🎯 Pitch¶

SonicMoE co-designs MoE computation and GPU kernels to cut activation memory and hide IO: it reorders backward computation to avoid caching large activations, overlaps asynchronous memory IO with Tensor‑Core GEMM, and introduces a tile‑aware “token rounding” router to eliminate padding waste in grouped GEMMs. Together these innovations make fine‑grained and highly sparse MoEs practical on Hopper/Blackwell GPUs—reducing per‑layer activation memory (e.g., ≈45% for a fine‑grained 7B MoE) and delivering substantial end‑to‑end training throughput gains that lower cost and enable larger, higher‑quality sparse models.

1. Executive Summary (2-3 sentences)¶

SonicMoE accelerates Mixture-of-Experts (MoE) training on NVIDIA Hopper/Blackwell GPUs by (i) redesigning the MoE backward computation to avoid caching large activations, (ii) using IO-aware GPU kernel techniques to overlap memory movement with Tensor Core compute, and (iii) introducing a tile-aware routing variant (token rounding) that reduces padding waste in grouped GEMM. It targets the modern trend toward fine-grained and highly sparse MoEs, where IO and padding—not FLOPs—become the main throughput bottlenecks, and reports large activation-memory savings and substantial throughput gains (e.g., 45% lower activation memory for a fine-grained 7B MoE, and higher end-to-end training throughput on H100s).

2. Context and Motivation¶

What problem/gap is addressed?
Modern MoEs increasingly use:
- Higher expert granularity (smaller expert intermediate dimension n relative to model hidden size d, i.e., larger G = d/n; Table 1).
- Higher sparsity (constant activated experts K but many more total experts E, i.e., smaller activation ratio ρ = K/E; Table 1).
These trends improve model quality per FLOP (motivated via MoE scaling laws cited in the text) but create hardware inefficiency:
1. Activation memory footprint grows with granularity in common kernels because they cache O(T K d)-sized tensors for backward (Section 3.2, Figure 1 left).
2. Arithmetic intensity drops as granularity increases and/or MoE becomes sparser, pushing kernels into a memory-bandwidth-bound regime (Eq. (4), Figure 3).
3. Grouped GEMM padding waste (“tile quantization”) grows for sparse MoEs because per-expert token counts become small and irregular, forcing padding to tile multiples (Section 5.1, Figures 11–12).
Why is it important?
Training throughput and memory limits determine feasible model sizes and cluster efficiency. The paper highlights end-to-end training throughput on H100 clusters (Section 6.2.1) and per-layer activation memory (Figures 1, 13), directly impacting scale and cost.
What prior approaches exist and where do they fall short?
Existing kernels/frameworks discussed include ScatterMoE, MoMoE, MegaBlocks, Megatron, and DeepGEMM-based baselines (Sections 2.2, 4; Table 2; Appendix B).
Key limitations (as described in Table 2 / Appendix B):
- Insufficient IO-latency hiding (no MMA/IO overlap as used here).
- Backward paths that still require separate gathers and/or caching Y, increasing both IO and activation memory (e.g., computing dS as ⟨dO, Y⟩).
- Padding/tile waste not addressed in sparse regimes (Section 5).
How does SonicMoE position itself?
As a co-design of:
- A mathematically equivalent MoE backward formulation that minimizes cached activations (Section 3.2, Appendix C),
- GPU kernel scheduling and fusion strategies that explicitly target Hopper/Blackwell asynchronous memory pipelines (Section 4.2),
- And a routing method (token rounding) designed around grouped-GEMM tile structure (Section 5).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a set of GPU kernels + a compatible routing interface that implements the forward and backward passes of transformer MoE MLP blocks efficiently.
It solves training-time bottlenecks (activation memory, IO overhead, and sparse padding waste) via a redesigned computation graph and hardware-aware kernel/routing choices, and is evaluated primarily on H100 (Hopper) with discussion of Blackwell support.

3.2 Big-picture architecture (diagram in words)¶

Inputs: token activations X ∈ R^{T×d}, router scores S ∈ R^{T×E}, routing mask π ∈ {0,1}^{T×E}, expert weights W1 (up-proj) and W2 (down-proj).
Forward path (3 kernels):
Up-proj kernel: gather tokens per expert → grouped GEMM (X_e W1,e) → fused activation (SwiGLU) → store H/A.
Down-proj kernel: grouped GEMM (A_e W2,e) → store contiguous Y.
Expert aggregation kernel: per token, gather its experts’ Y rows and combine using S and π to produce O.
Backward path (5 kernels):
dH kernel (down-proj activation grad): gather dO per expert → GEMM with W2 → fused activation backward and dS computation → produce dH, dS, and A′.
dW2 kernel: gather dO per expert → varlen-K grouped GEMM to compute dW2.
dX̃ kernel (up-proj activation grad): varlen-M grouped GEMM dH W1^T.
dW1 kernel: gather X per expert → varlen-K grouped GEMM to compute dW1.
dX aggregation kernel: per token, gather and sum its experts’ dX̃ contributions.

(These correspond to Figure 4 and Algorithms 2, 3, and 5.)

3.3 Roadmap for the deep dive¶

First, define the baseline MoE computation as grouped GEMM and why IO/padding dominate (Section 2.1–2.2, Eq. (4)).
Next, explain the activation-memory reduction via an alternate backward path that avoids caching Y/dY (Section 3.2, Appendix C).
Then, detail the IO-aware kernel design:
gather fusion,
epilogue fusion,
and MMA/IO overlap (Section 4.1–4.2; Figures 5–9).
Finally, explain token rounding routing and how it removes tile padding waste with bounded deviation (Section 5; Algorithm 4; Figures 11–12).

3.4 Detailed, sentence-based technical breakdown¶

This is primarily an empirical systems + algorithmic kernel paper: it keeps MoE math equivalent (for the core computation) while changing how gradients and routing metadata are computed and how GPU work is scheduled to reduce memory/IO bottlenecks.

Baseline MoE computation and where inefficiency comes from¶

A standard transformer MoE MLP replaces a dense MLP with E experts, and for each token activates only K experts using router scores S and a routing mask π (Section 1, Algorithm 1).
Computation per expert uses two matrix multiplies with a nonlinearity (SwiGLU example in Section 2.2):
Up-projection: H_e = X_e W1,e where X_e is the subset of tokens routed to expert e.
Activation: A_e = SwiGLU(H_e).
Down-projection: Y_e = A_e W2,e.
Because each expert gets a variable number of tokens T_e, the GPU implementation uses varlen-M grouped GEMM (variable M = T_e, fixed N,K from weights) for forward/activation-gradient, and varlen-K grouped GEMM for weight gradients (Section 2.1).
The paper analyzes arithmetic intensity (FLOPs per byte moved) and derives that it decreases when:
granularity increases (G = d/n grows), or
sparsity increases (ρ = K/E decreases), pushing MoE kernels toward being memory bandwidth bound (Eq. (4), Figure 3).

(1) Memory-efficient backward via avoiding large activation caches¶

Under iso-FLOPs training, keeping overall MoE compute constant requires keeping nK constant; increasing granularity (smaller n) implies larger K (Section 3.2). Therefore, any cached activation scaling like O(T K d) grows linearly with granularity and becomes problematic.
The paper identifies two main large intermediates that common implementations cache:
Y (down-proj output), size T K d,
and gathered inputs X_e, also effectively T K d if materialized (Section 3.2).
SonicMoE reduces peak activation memory by:
Fusing gathers with HBM loads so X and dO are not materialized as gathered tensors in HBM (Section 3.2; also discussed as gather fusion in Section 4.1.1).
Computing router-score gradients dS without needing Y or dY, by rewriting the gradient algebra (Section 3.2; Appendix C).
Concretely, in the backward down-proj activation-gradient kernel (dH kernel, Algorithm 3), SonicMoE:
gathers dO for expert e,
computes dA′_e = dO_e W2,e^T,
multiplies by the expert score s_e (broadcast) to obtain dA_e,
computes forward activation output A_e and activation gradient dH_e together via a fused dAct (epilogue fusion described in Section 4.1.2),
constructs A′_e = Broadcast(s_e) A_e as the input needed for dW2,
and computes dS_{e,t} = ⟨dA′_{e,t}, A_{e,t}⟩ (Algorithm 3; Appendix C / C.1).
The claimed memory outcome is that SonicMoE only needs to cache X and H (plus negligible routing metadata) for a per-layer cached activation size of:
2 T d + 4 T K n bytes (Section 3.2), which the paper argues matches the minimum required without doing GEMM recomputation (Section 3.2, footnotes 8–9).
This design is reflected in Figure 1 (left) and Figure 13, where SonicMoE’s per-layer peak activation memory stays constant as granularity increases, unlike baselines.

(2) IO-aware grouped GEMM kernels (fusion + overlap)¶

Kernel set and dataflow

SonicMoE’s MoE computation launches 8 kernels total: 3 forward + 5 backward (Section 3.1; Figure 4).
These kernels are built from two modular building blocks (Section 3.1):
an optimized grouped GEMM (varlen-M and varlen-K) with configurable fusion,
an optimized expert aggregation kernel.

Gather fusion (HBM→SMEM) and why it matters

For varlen-M grouped GEMM, inputs often require gathering token rows from X based on routing indices (Figure 2). SonicMoE fuses the gather into the global-memory load path (Section 4.1.1).
On Hopper, this uses cp.async-based loading to SMEM (Section 4.1.1), with additional strategies to reduce index-fetch overhead (Figure 18, described in Appendix D).
The practical motivation is that an approach like DeepGEMM assumes inputs are already contiguously packed; packing/gathering then becomes a separate kernel with significant IO cost, which dominates in fine-grained regimes (Section 4.1.1, Figure 6).
For varlen-K grouped GEMM (weight gradients), SonicMoE also fuses gather; baselines like ScatterMoE and MoMoE are described as not fusing gather there (Figure 20 discussion; Table 2).

Epilogue fusion (do more while outputs are still “hot”)

SonicMoE uses the GEMM epilogue (the post-matmul stage) to fuse:
forward SwiGLU,
backward dSwiGLU,
and computation of dS and dH in the down-proj activation-gradient kernel (Section 4.1.2).
A key design choice is the formula for dS:
SonicMoE computes dS_{e,t} = ⟨dA′_{e,t}, A_{e,t}⟩ (Algorithm 3; Appendix C.1),
while ScatterMoE/MoMoE compute dS as ⟨dO_t, Y_{e,t}⟩, which requires loading/caching Y (Appendix C.1) and adds 2 T K d bytes of HBM load (Section 4.1.2).
Figure 6 provides a kernel-time breakdown illustrating that SonicMoE’s fused dH kernel is much cheaper than baselines that separate dS/dSwiGLU/gathers.

Overlapping MMA with IO (Ping-Pong scheduling and async TMA)

Hopper GEMM uses a producer-consumer structure: producer warps load tiles, consumer warpgroups issue Tensor Core MMA (Section 2.1).
SonicMoE exploits this with Ping-Pong warpgroup scheduling (Figure 7; Section 4.2):
one consumer warpgroup runs the MMA mainloop,
while another performs epilogue and/or IO,
then they swap roles to overlap compute and heavy epilogue/IO.
This is emphasized for kernels with heavy epilogues:
forward down-proj Y (large stores),
backward down-proj activation-gradient dH (loads H, computes/stores dH, dS, A′) (Section 4.2).
SonicMoE also uses asynchronous TMA operations to overlap global↔shared movement, including:
async TMA load of H during dH epilogue (Section 4.2),
async TMA stores for grouped GEMMs (Section 4.2).
A consequential design choice is to avoid fusing scatter into the GEMM store on Hopper:
scatter requires synchronous st.global stores (Section 4.2, footnote 20) and repeated index work (footnote 19),
which blocks overlap and reduces TFLOPS for heavy-store cases (Figure 8).
Instead, SonicMoE stores Y contiguously (TMA) and later does gather+sum in the expert aggregation kernel (Figure 9 left; Section 4.2).
This store/aggregation strategy is benchmarked in Figure 22, showing SonicMoE’s “gemm + gather w. sum” strategy outperforming scatter-based designs.

Blackwell-specific note (as provided)

For Blackwell, the paper describes architectural differences relevant to overlap:
a 256KB per-SM TMEM accumulator and the UMMA instruction reduce register pressure and enable a two-stage accumulator pipeline for better epilogue/MMA overlap (Section 4.2).
It also describes a gather-fusion challenge in 2-CTA clusters: cp.async completion is CTA-local, requiring a relay warp and cluster-scope synchronization (Figure 5; Section 4.1.1).

(3) Token rounding routing to remove tile padding waste in sparse regimes¶

Sparse MoEs can waste substantial compute because grouped GEMM pads per-expert token counts T_e to a tile multiple Mtile (Section 5.1; Figures 11–12).
Token rounding (TR) is a drop-in routing modification intended to make each expert’s token count a multiple of the GEMM tile size while deviating minimally from top-K token-choice routing (Section 5.2; Algorithm 4).
Algorithm 4 works in two sorting steps:
Token-choice top-K routing produces top-K experts per token (StopK, ItopK).
For each expert e, compute its token frequency f_e and the adjacent multiples ⌊f_e⌋_{Mtile} and ⌈f_e⌉_{Mtile}.
Build a modified score matrix S′ that strongly prefers the original top-K selections by shifting non-top-K entries down (step (3) in Algorithm 4).
For each expert, sort tokens by S′ and then round and sparsify: either drop some of the lowest tokens or pad in extra tokens so total tokens become a tile multiple (Algorithm 4).
The paper highlights a guarantee: for each expert, the routing deviation from the original token-choice result is bounded by at most one tile (Section 5.2).
The default rounding subroutine chooses the nearest multiple (“NR-f”: nearest rounding by expert frequency; Section 5.2, Appendix F.2), though other variants are tested (Table 6).

Configurations / “hyperparameters” explicitly provided in the content¶

Because this work is primarily a kernel + routing paper, the most relevant “hyperparameters” are MoE shapes, microbatch sizes, and routing settings:

MoE tensor shapes and symbols: T (microbatch tokens), d (model hidden size), n (expert intermediate dim), E (experts), K (activated experts), ρ=K/E, G=d/n (Table 4).
Benchmark configurations used throughout kernel experiments (Table 9b), e.g.:
7B: T=24576, d=1536, with (n,E,K) ∈ {(256,128,8), (512,64,4), (1024,32,2)}.
30B: T=32768, d=4096, (n,E,K) ∈ {(256,256,16), (512,128,8), (1024,64,4)}.
Hardware and datatypes: results focus on H100 (Hopper) with BF16 mentioned in the abstract and multiple figures; Blackwell support is discussed for kernel mechanisms (Sections 4.1.1, 4.2).
Routing/top-K kernel constraints: SonicMoE provides a top-K kernel supporting E ≤ 4096 and K ≤ 16 (Section 4.3).
LM training settings for routing-quality experiments (Appendix H, Table 10):
Context length 4096.
Dataset: deduplicated FineWeb-Edu (Appendix H).
Load-balancing auxiliary loss coefficient 0.01, no router Z-loss (Appendix H).
LR / WD / scheduler are listed per config in Table 10 (cosine with 10% warmup; WD 0.01; LR 6e-4 for 0.5B/1.8B configs and 4e-4 for 1.4B configs).
Optimizer is not specified in the provided text, so it cannot be reported precisely here.
Tile size: many TR experiments use Mtile = 128 (Section 5.2; Figure 16 notes; Table 3 note 26), with ablations over Mtile in Table 8.

4. Key Insights and Innovations¶

Activation-memory-minimal backward path without extra FLOPs
Novelty: The backward computation is reorganized so that dS and dH can be computed without caching Y/dY, avoiding O(T K d) cached activations that scale with granularity (Section 3.2; Appendix C.1).
Significance: Enables activation memory that is independent of expert granularity (Figure 1 left, Figure 13) and yields up to 45% per-layer activation memory reduction for a fine-grained 7B MoE compared to ScatterMoE (Section 6.1).
Systematic IO-hiding design for fine-grained MoEs
Novelty: The design combines gather fusion, heavy epilogue fusion, asynchronous TMA load/store, and Ping-Pong scheduling to explicitly overlap memory IO and compute in grouped GEMM kernels (Sections 4.1–4.2; Figures 7–8).
Significance: Targets the regime where arithmetic intensity collapses (Eq. (4), Figure 3) and achieves high fractions of an upper bound in forward throughput (Figure 1 right mentions ~88% of a cuBLAS-based upper bound for a 30B configuration).
Tile-aware token rounding routing
Novelty: A routing method that rounds each expert’s token count to GEMM tile multiples while preserving token-choice preferences, with a bounded per-expert deviation of ≤ 1 tile (Section 5.2).
Significance: Reduces compute wasted on padding (“tile quantization”), giving additional speedups in high-sparsity regimes (Figures 16–17) while maintaining similar downstream performance (Tables 3, 6–8).
Efficient router top-K kernel tailored to large T
Novelty: A custom top-K using bitonic sorting networks, register-level operations, and packing column indices into FP32 mantissa bits (Section 4.3; Figure 10).
Significance: Addresses a nontrivial router overhead: PyTorch topk can take ~40% of router time (Section 4.3), and the kernel outperforms multiple baselines in bandwidth benchmarks (Figure 23).

5. Experimental Analysis¶

Evaluation methodology (what is measured and how)¶

Kernel-level performance profiling on H100
Runtime breakdowns for forward/backward kernels and components (router, gathers, grouped GEMMs, activations, aggregation) are shown in Figure 6 for a 7B training configuration (T,d,n,E,K)=(24576,1536,256,128,8).
Throughput is reported as:
- TFLOPS for compute-heavy grouped GEMMs,
- TB/s for memory-bound components like gather, (d)SwiGLU, aggregation (Figure 6 annotations).
Activation memory footprint
Peak per-layer activation memory is compared across model scales 1.4B–120B and across multiple baselines (Figure 13), including configurations listed in Table 9b.
End-to-end training throughput
A multi-node experiment trains a 7B MoE with FSDP-2 in the lm-engine codebase and reports tokens/day (Section 6.2.1).
Token rounding quality
MoE models are trained with TR but evaluated using standard token-choice top-K routing (Section 6.3.1) to test “drop-in replacement” feasibility.
Quality metrics: train/validation perplexity and 11 downstream task accuracies (Table 3).
Ablations: rounding subroutines (Table 6), microbatch size T / average tokens per expert T̄_e (Table 7), and tile size Mtile (Table 8).
Token rounding throughput
Benchmarks compare model TFLOPS (defined using model FLOPs, not padded hardware FLOPs) for TR vs vanilla token-choice top-K, focusing on sparse regimes by scaling E at fixed K (Figure 16) and on real-world MoE shapes (Figure 17).

Main quantitative results (with specific numbers)¶

Activation memory

For the 7B configuration with n=256, SonicMoE reduces per-layer memory usage by 45% compared to ScatterMoE (Section 6.1; Figure 13).
SonicMoE’s activation memory is reported as:
independent of expert granularity (Figure 1 left; Section 3.2),
and substantially smaller than MoMoE at larger scales (e.g., > 3 GiB per layer saved at 120B vs MoMoE, Section 6.1).

Kernel/runtime performance

Figure 6 (7B config) shows end-to-end kernel-time reductions:
Forward total time: SonicMoE 1.237 ms vs ScatterMoE 2.255 ms, MoMoE 2.347 ms, Megatron 3.476 ms, etc. (Figure 6 left).
Backward total time: SonicMoE 2.173 ms vs ScatterMoE 3.968 ms, MoMoE 4.678 ms (Figure 6 right).
The abstract reports a 1.86× compute throughput improvement on Hopper GPUs compared to ScatterMoE’s BF16 MoE kernel for a fine-grained 7B MoE.

Full-layer forward/backward TFLOPS across scales

Figure 14 shows SonicMoE has the highest TFLOPS across 1.4B/7B/30B/120B benchmark configurations (Table 9b), and the text summarizes:
~40% TFLOPS improvement over ScatterMoE and MoMoE in 1.4B and 7B settings (Section 6.2.1).
On configurations derived from recent open-source MoEs, SonicMoE is reported to:
generally exceed 550 TFLOPS forward and backward (Figure 15 narrative),
and succeed on a DeepSeek-V3.2-Exp-like layer configuration where several baselines fail due to index overflow or OOM, achieving 534.8 TFLOPS forward and 480.1 TFLOPS backward on a single H100 (Section 6.2.1, Figure 15 narrative).

End-to-end multi-GPU training throughput

Section 6.2.1 reports:
SonicMoE on 64 H100s achieves 213 billion tokens/day,
compared to ScatterMoE’s 225 billion tokens/day on 96 H100s, for a 7B MoE model with FSDP-2 using lm-engine.

Token rounding throughput and quality

Throughput:
Figure 16 reports that in sparse settings, TR improves model TFLOPS over token-choice top-K, with example improvements stated in Section 6.3.3:
- For E=128 (with n=1k in the referenced columns): +16.5% forward, +6.1% backward, +9.4% end-to-end.
- For E=256 (high sparsity, K/E=1/128 in the referenced example): +25.7% forward, +11.8% backward, +15.9% end-to-end.
Figure 17 shows similar gains for real MoE configs, e.g. for K/E = 10/512 (Qwen3-Next-80B-A3B-Thinking-like config): +19.6% forward and +7.9% backward (Section 6.3.3).
Quality:
Table 3 shows TR-trained models evaluated with TC top-K achieve similar validation perplexity and average task accuracy to TC baselines across multiple sparsity regimes (e.g., 0.5B 8/64, 0.5B 2/64, 1.8B 8/256, 1.4B 2/128).
TR is reported as robust to rounding subroutine choice (Table 6) and generally robust when T̄_e / Mtile ≥ 2 (Section 5.2; Tables 7–8).

Do the experiments support the claims?¶

Memory claim support: Yes; Figures 1 and 13 directly measure peak activation memory vs granularity and across scales, and Section 3.2 provides a concrete accounting argument (2Td + 4TKn bytes).
Throughput claim support: Largely yes; Figures 6, 14, 15, and the cluster tokens/day measurement in Section 6.2.1 triangulate improvements at kernel, layer, and end-to-end training levels.
Token rounding claim support: The combination of (i) padding waste visualization (Figures 11–12), (ii) TFLOPS improvements (Figures 16–17), and (iii) downstream task/perplexity comparisons (Table 3 plus ablations Tables 6–8) provides consistent evidence, with an explicit condition (T̄_e / Mtile ≥ 2) for robustness.

6. Limitations and Trade-offs¶

TR routing is not directly autoregressive-inference compatible
The paper notes TR “is not a token-choice routing method,” creating difficulty for autoregressive generation, and evaluates by switching back to standard token-choice top-K at validation time (Section 6.3.1, footnote 28). This implies TR is mainly proposed as a training-time routing optimization.
TR quality depends on having enough tokens per expert per microbatch
The paper reports degradation when T̄_e / Mtile = 1 (Tables 7–8 discussion), and recommends robustness when T̄_e / Mtile ≥ 2 (Section 5.2). This constrains TR’s applicability for very small microbatches or extreme sparsity.
Hardware specificity
Key performance mechanisms explicitly depend on Hopper/Blackwell features (cp.async, TMA, Hopper warpgroup scheduling, Blackwell TMEM/UMMA) (Sections 4.1–4.2). The provided text does not characterize performance on older GPU generations.
Top-K kernel parameter limits
SonicMoE’s custom top-K kernel supports E ≤ 4096 and K ≤ 16 (Section 4.3). Larger values would require different synchronization/buffering strategies as acknowledged there.
Scatter-vs-gather design trade-offs
SonicMoE avoids scatter-fused stores on Hopper because they block overlap (Section 4.2; Figure 8), but this introduces an explicit expert aggregation kernel (Figure 9 left). This is a deliberate trade-off: fewer synchronous stores and better overlap at the cost of an extra aggregation step.
Determinism / numerical issues with atomic alternatives
The paper discusses an alternative using atomic adds to avoid an aggregation kernel (Figure 9 right) but rejects it due to nondeterminism and BF16 atomic-add numerical concerns, plus incompatibility with expert-parallel collectives (Section 3.2 footnote 9; Figure 9 caption).
Training setup details are incomplete in the provided excerpt
While LR/WD/schedule/context length/dataset are specified for routing-quality experiments (Appendix H), the optimizer choice and some standard training hyperparameters (e.g., betas, gradient clipping) are not provided in the included text, limiting full reproduction from this excerpt alone.

7. Implications and Future Directions¶

Field impact
The work pushes MoE efficiency toward a regime where quality-per-FLOP scaling trends (more granular + sparser MoEs) can be pursued without collapsing hardware utilization due to IO and tile padding. It reframes “efficient MoE” as an IO/scheduling problem as much as a FLOP-count problem (Eq. (4), Figure 3; Sections 4–5).
Practical applications
Faster and more memory-efficient pretraining/fine-tuning of MoE LLMs on Hopper/Blackwell-class GPUs, especially when:
- experts are fine-grained (n small relative to d),
- MoE is sparse (K constant, large E),
- activation memory is a bottleneck (Figures 1, 13),
- grouped GEMM padding waste becomes visible (Figures 11–12).
Repro/Integration Guidance (when to use what)
Prefer SonicMoE’s kernel design when you are training MoEs where IO dominates:
- high granularity (G=d/n large),
- and/or high sparsity (ρ=K/E small), because SonicMoE explicitly targets gather IO, epilogue cost, and overlap (Sections 3–4; Figures 6, 14).
Consider enabling token rounding primarily when:
- you are in a highly sparse regime where per-expert token counts are near tile sizes,
- and you can keep T̄_e / Mtile ≥ 2 (Section 5.2; Tables 7–8), since this is where TR both speeds up and remains quality-robust.
If your use case requires strict token-choice routing at inference (typical autoregressive decoding), the paper’s evaluation protocol suggests training with TR and evaluating/serving with token-choice top-K (Section 6.3.1), but TR itself is not positioned as an inference router.
Future directions named in the text
Extending to lower precision / microscaling formats such as FP8, MXFP8, MXFP4 for further memory savings (Conclusion).
Overlapping communication with computation in distributed expert-parallel settings (Conclusion), complementing the paper’s focus on IO/compute overlap inside a GPU.
Broader design lesson emphasized
The conclusion argues for designing architectures that optimize quality per compute hour (considering real hardware bottlenecks) rather than only quality per theoretical FLOP, aligning with the paper’s emphasis on IO-aware kernel and routing co-design.