Transformers without Normalization¶
ArXiv: 2503.10622
🎯 Pitch¶
This paper introduces Dynamic Tanh (DyT), a simple, element-wise alternative to traditional normalization layers in Transformers. By showing that DyT can match or exceed the performance of LayerNorm and RMSNorm across diverse vision, language, and generative tasks—without the need for costly per-token statistics—this work both streamlines model implementations and fundamentally challenges the long-held belief that normalization is indispensable in deep neural networks.
1. Executive Summary¶
This paper shows that standard normalization layers in Transformers (such as LayerNorm and RMSNorm) can be replaced by a simple element‑wise operation, Dynamic Tanh (DyT), with little or no loss in accuracy and often small gains. The key insight is that the input→output mappings of LayerNorm in trained Transformers look like an S‑shaped curve, which tanh(αx) replicates while avoiding the cost and complexity of computing per‑token statistics.
2. Context and Motivation¶
- Problem addressed
- Modern Transformers almost always include normalization layers (LayerNorm/RMSNorm). These are widely believed to be essential for stable optimization and good generalization in deep, wide models.
-
The paper asks: Are normalization layers actually indispensable in Transformers, or can we achieve the same effects with something simpler?
-
Why this matters
- Practical: Normalization layers compute means/variances and require reduction operations. That adds kernel complexity and can be a bottleneck on some hardware. A drop‑in, element‑wise replacement would simplify implementations and may enable new fusions and optimizations (Appendix C).
-
Conceptual: If normalization is not strictly required, we gain a clearer understanding of what it really does in deep networks.
-
Where prior approaches fall short
- “No‑norm” methods existed but rely on carefully crafted initializations (Fixup, SkipInit) or weight reparameterizations/constraints (e.g., spectral reparametrization). These often need significant hyperparameter tuning and may underperform normalized baselines (Section 6.3; Table 9).
-
Some methods remove norms only after pretraining via fine‑tuning, rather than training from scratch without norms.
-
How this paper positions itself
- It proposes a tiny, drop‑in replacement called
Dynamic Tanh (DyT)that:- Does not compute statistics.
- Is element‑wise and therefore simple to implement and potentially easier to optimize.
- Empirically matches or exceeds LayerNorm/RMSNorm across diverse tasks and scales, including large language models (Sections 5 and 7.2).
3. Technical Approach¶
Step-by-step overview:
- What normalization layers do (background and observation)
- Standard formulation (Equation 1, Section 2): a normalization layer transforms input
xby subtracting a meanµ, dividing by the standard deviationσ, then applying learnable per‑channel scaleγand shiftβ. - Empirical observation (Section 3, Figures 2–4):
- When plotting the element‑wise input vs. output (before the learned affine
γ/β) of LayerNorm in trained models (ViT, wav2vec 2.0, DiT), the mapping looks like an S‑curve—highly reminiscent oftanh. - Deeper LayerNorm layers show this effect most clearly (Figure 2). Earlier layers look more linear.
- By coloring points by token (left panels of Figure 4), each token’s mapping is linear but with a different slope (because each token has different variance). Collectively these different lines form an S‑curve.
- By coloring by channel (right panels of Figure 4), a few channels exhibit extreme input ranges; these are squashed the most by normalization.
- When plotting the element‑wise input vs. output (before the learned affine
Plain-language interpretation: - LayerNorm isn’t globally linear over all elements. Across tokens with different statistics it collectively acts like a near‑linear mapping around zero, but it disproportionately squashes extreme values—just like a saturating nonlinearity.
- The proposed replacement: Dynamic Tanh (
DyT) - Definition (Equation 2, Section 4):
DyT(x) = γ * tanh(α x) + βα: a single learnable scalar that rescales inputs so thattanhoperates in the “right” part of its S‑curve.γ,β: standard learnable per‑channel scale and shift, same shapes as in LayerNorm/RMSNorm.
- Implementation (Algorithm 1, Section 4): a tiny module—apply
tanh(αx), then affine scale/shift. - Where it is used (Figure 1): replace each normalization layer in attention blocks, MLP/FFN blocks, and the final normalization.
-
What it does mechanistically:
- Near zero,
tanhis approximately linear, so most activations pass almost unchanged (Figure 3 shows different slopes via differentα). - Large-magnitude activations are squashed into a bounded range (−1 to 1), reproducing the key “extreme‑value suppression” observed in LayerNorm (Figures 2–4).
- The scalar
αadapts over training and closely tracks the inverse activation scale: Section 6.2 and Figure 8 showαcorrelates with1/stdboth during and after training.
- Near zero,
-
Design choices and rationale
- Why
tanh? Section 6.1 and Figure 7 comparetanh,hardtanh,sigmoid, and an identity mapping:- Without squashing (identity), training diverges (Table 7).
- With squashing, training is stable;
tanhperforms best among the tested functions (Table 7), likely due to smoothness and being zero-centered.
-
Why a single scalar
α(instead of per-channel or per-token)?- Simplicity and stability. Empirically, a single
αalready learns to match global scale dynamics (Figure 8). Per-channel or per-tokenαis not explored here.
- Simplicity and stability. Empirically, a single
-
Practicalities and initialization
- Default initialization:
γ=1,β=0,α0=0.5typically works without hyperparameter changes (Section 4; Section 7.1). - LLM exception: training large LLaMA models benefits from tuned
α0, with different values in attention vs. other blocks, and smallerα0as width increases (Section 7.2; Table 10; Figure 11; Table 11). -
LLM embedding scale: an extra learnable scalar right after the embedding, initialized to
√d, is added so early activations aren’t too small (Appendix A, “Large Language Models”). -
How DyT differs from normalization in computation and behavior
- No reduction: DyT is element‑wise; no means/variances are computed.
- No per-token adaptation: LayerNorm normalizes each token separately; DyT uses a single global
α. The nonlinearity oftanhsupplies the extreme‑value squashing. - Affine re-scaling is retained via
γ/β, preserving representational flexibility.
4. Key Insights and Innovations¶
- Empirical reinterpretation of LayerNorm’s role (fundamental insight)
- Observation: LayerNorm’s aggregated input→output mapping across tokens is S‑shaped, strongly resembling
tanh(Section 3; Figures 2–4). -
Significance: Recasts normalization’s global effect not as “pure normalization,” but as “near‑linear around zero + outlier squashing,” clarifying why it stabilizes training.
-
A minimalist, drop‑in alternative to normalization (
DyT) (core contribution) - Element‑wise
tanh(αx)plus standard affine parameters replaces LN/RMSNorm across Transformer blocks (Section 4; Figure 1; Equation 2). -
No statistics, no reductions, simple kernel—yet comparable or better performance across many tasks (Section 5, Tables 1–6).
-
Understanding and leveraging
αas a learned scale controller (explanatory insight) αtracks1/stdof activations during training and correlates with it after training (Figure 8), showing that DyT learns an implicit “global normalization” scale.-
Removing
αhurts performance (Table 8), confirming its necessity. -
Practical training guidelines for LLMs (useful innovation)
- Tuned
α0improves LLM training; larger widths need smallerα0; attention blocks benefit from higherα0than MLP/final blocks (Section 7.2; Table 10; Figure 11; Table 11). -
This yields stable 7B–70B LLaMA training matching RMSNorm in loss and zero‑shot accuracy (Table 4; Figure 6).
-
Strong comparison to other “no‑norm” methods (evidence of significance)
- DyT outperforms initialization‑based approaches (Fixup, SkipInit) and matches/exceeds σReparam in ViT/MAE settings (Table 9).
5. Experimental Analysis¶
- Evaluation setup (Section 5; Appendix A)
- “Replace all LN/RMSNorm with DyT” and keep the rest of the architecture unchanged (Figure 1).
-
Hyperparameters: as close as possible to the original training recipes; in most vision/speech/DNA experiments no tuning is needed. For DiT, a small LR search is done on the LN baseline and reused for DyT (Appendix A). For LLMs,
α0is tuned and an embedding scalar is added (Appendix A; Section 7.2). -
Datasets, tasks, metrics
- Supervised Vision on ImageNet‑1K (top‑1 accuracy): ViT‑B/L and ConvNeXt‑B/L.
- Self‑supervised Vision: MAE and DINO pretrain on ImageNet‑1K, then fine‑tune (top‑1 accuracy).
- Diffusion Models (DiT) on ImageNet‑1K: Fréchet Inception Distance (FID; lower is better).
- Large Language Models (LLaMA 7B/13B/34B/70B): trained on The Pile to 200B tokens; report pretraining loss and average zero‑shot score across 15 lm‑eval tasks (Table 4).
- Speech (wav2vec 2.0 on LibriSpeech): validation loss.
-
DNA sequence modeling (HyenaDNA, Caduceus): average accuracy across GenomicBenchmarks datasets.
-
Main quantitative results
- Supervised Vision (Table 1):
> ViT‑B: 82.3% (LN) → 82.5% (DyT); ViT‑L: 83.1% → 83.6%
> ConvNeXt‑B: 83.7% → 83.7%; ConvNeXt‑L: 84.3% → 84.4%- Training losses are nearly identical (Figure 5), suggesting similar learning dynamics.
- Self‑supervised Vision (Table 2):
> MAE ViT‑B: 83.2% → 83.2%; MAE ViT‑L: 85.5% → 85.4%
> DINO ViT‑B (p16): 83.2% → 83.4%; DINO ViT‑B (p8): 84.1% → 84.5% - Diffusion (Table 3):
> DiT‑B FID: 64.9 → 63.9 (better); DiT‑L: 45.9 → 45.7 (better); DiT‑XL: 19.9 → 20.8 (worse)
- Mostly comparable; one degradation at XL size.
- LLMs (Table 4; Figure 6): > Zero‑shot average and final training loss match RMSNorm across 7B/13B/34B/70B, with at most ±0.01 difference in loss for smaller models.
- Speech (Table 5): > Base: 1.95 → 1.95; Large: 1.92 → 1.91 (slightly better)
-
DNA (Table 6): > HyenaDNA: 85.2% → 85.2%; Caduceus: 86.9% → 86.9%
-
Ablations, diagnostics, and analysis
- Squashing is essential (Section 6.1; Table 7; Figure 7):
> Replacing
tanhwith identity leads to divergence. Squashing withhardtanh/sigmoidtrains, but underperformstanh. αis essential (Section 6.1; Table 8): > Removingαdrops ViT‑B top‑1 from 82.5% to 81.1%.αdynamics and interpretation (Section 6.2; Figure 8): >αtracks1/stdduring training; finalαcorrelates with1/stdacross layers, supporting the “implicit scale normalization” view.- Comparison to other norm‑removal methods (Section 6.3; Table 9):
> ViT‑B: Fixup 77.2%, SkipInit 74.1%, σReparam 82.5%, DyT 82.8% (LN is 82.3%).
> MAE ViT‑L: Fixup 74.1%, SkipInit 74.0%, σReparam 85.4%, DyT 85.8% (LN is 85.5%). - Sensitivity to
α0- Non‑LLM tasks: broad plateau; α0 in [0.5, 1.2] usually works (Figure 9). Larger models or higher LRs need smaller
α0to avoid instability; DyT withα0=0.5has stability similar to LN (Figure 10). - LLMs: best
α0depends strongly on model width and block type (Section 7.2):Optimal
α0(attention / other):
7B: 0.8 / 0.2; 13B: 0.6 / 0.15; 34B: 0.2 / 0.05; 70B: 0.2 / 0.05 (Table 10)
Width, not depth, primarily determinesα0(Table 11).
- Non‑LLM tasks: broad plateau; α0 in [0.5, 1.2] usually works (Figure 9). Larger models or higher LRs need smaller
- Efficiency (Appendix C; Tables 14–15):
> Without compilation, DyT speeds up the norm layers a lot and yields ≈8% end‑to‑end speedups on LLaMA‑7B. After
torch.compile, DyT and RMSNorm have similar latency. -
Failure case in ConvNets with BatchNorm (Appendix D; Table 16): > Replacing BN with DyT in ResNet‑50: 76.2% → 68.9%; VGG19: 72.7% → 71.0%. DyT is not a drop‑in replacement for BN.
-
Do the experiments support the claims?
- Breadth: The method is tested across recognition (supervised), self‑supervised, generation (diffusion), speech, DNA, and LLMs, using standard public codebases and recipes (Section 5; Appendix A).
- Strength: On Transformers with LN/RMSNorm, DyT consistently matches or slightly betters baselines, including at large LLM scales (Table 4).
- Caveats:
- Some adjustments exist: DiT learning rate search on the baseline and non‑zero init differences for DyT (Appendix A), and an extra embedding‑scale parameter for LLMs (Appendix A). These are documented and reasonable, but they mean “no‑tuning” has exceptions.
- Not universal: Fails to replace BatchNorm in classic ConvNets (Appendix D).
6. Limitations and Trade-offs¶
- Scope limitation
-
The positive results primarily cover Transformers using LayerNorm or RMSNorm. DyT is not shown to replace BatchNorm in ConvNets effectively (Appendix D).
-
Granularity of normalization effect
-
DyT uses a single scalar
αshared across channels/tokens. It cannot reproduce per‑token standardization like LN. The outlier suppression comes from the nonlinearity rather than per‑token variance control. This works well empirically but might be suboptimal in settings where token‑wise normalization is crucial. -
Initialization sensitivity in LLMs
-
Large, wide LLMs require careful
α0selection, differing between attention and other blocks (Section 7.2). This adds a small but non‑negligible tuning burden compared to off‑the‑shelf RMSNorm. -
Potential saturation
-
Because DyT relies on
tanh, overly largeαor extreme activations could push many values into saturation, diminishing gradients. The paper mitigates this by learningαand shows empirically thatαtracks1/std(Figure 8), but no theoretical guarantees are provided. -
Efficiency gains are situational
-
After compiler optimizations (
torch.compile), DyT and RMSNorm have similar latency (Appendix C, Table 15). The hoped‑for speedup depends on hardware/kernels and is not guaranteed. -
Theoretical underpinnings
- The paper provides compelling empirical evidence and a mechanistic interpretation but no formal proof that
tanh(αx)and LayerNorm are equivalent in any sense; this remains an open theoretical question.
7. Implications and Future Directions¶
- How this changes the landscape
- It reframes normalization in Transformers as largely “S‑curve squashing with scale adaptation,” not necessarily the computation of per‑token statistics. That opens a path to simpler, norm‑free architectures.
-
For practitioners, this provides a practical alternative when normalization computation is undesirable (e.g., custom accelerators where reductions are expensive) or when kernel simplicity helps deployment.
-
Follow‑up research enabled/suggested
- Theory: Formalize when and why S‑curve squashing plus a learned global scale can substitute for per‑token normalization; analyze gradient flow and sharpness effects under DyT.
- Variants of DyT:
- Per‑channel or per‑head
α; gatingαby layer depth; dynamicαconditioned on attention statistics. - Other smooth, zero‑centered squashing functions or learned S‑curves.
- Per‑channel or per‑head
- Beyond Transformers:
- Investigate why DyT fails to replace BatchNorm in ConvNets (Appendix D) and whether architectural changes (e.g., fewer norm sites, different residual scaling) can make it viable there.
- Optimization and systems:
- Kernel fusion with adjacent matmuls; low‑precision and quantization friendliness; custom hardware implementations leveraging no‑reduction property.
-
Applications:
- Test in reinforcement learning, multi‑modal models, retrieval‑augmented systems, and very long‑context models to probe limits of token‑wise vs. global scaling.
-
Practical takeaways
- If you use Transformers with LN/RMSNorm, you can likely swap in
DyT(γ * tanh(αx) + β) with your existing hyperparameters. Useα0 = 0.5by default; for LLMs, start from Table 10/11 heuristics (higherα0in attention blocks, smaller with greater width) and add the post‑embedding scalar initialized to√d(Appendix A). - Expect accuracy parity and similar loss curves (Figures 5 and 6), with occasional small gains. Do not expect universal speedups after compilation (Appendix C).
Overall, this work offers a concise, empirically strong alternative to normalization in Transformers, backed by a compelling mechanistic insight—LayerNorm’s observed S‑curve behavior—which DyT emulates with a single learned scale and a saturating nonlinearity.