Stronger Normalization-Free Transformers¶

🎯 Pitch¶

The paper introduces Dynamic erf (Derf), a simple learnable point-wise replacement for normalization layers defined as Derf(x)=γ·erf(αx+s)+β, found via a principled search over function properties. Replacing LayerNorm/RMSNorm with Derf yields consistent gains across vision, generation, speech, DNA, and language tasks, reducing dependence on activation statistics and improving generalization while simplifying Transformer implementations.

1. Executive Summary (2-3 sentences)¶

This paper develops a stronger “normalization-free” Transformer design by replacing normalization layers (e.g., LayerNorm, RMSNorm) with a simple elementwise (point-wise) mapping called Dynamic erf (Derf), defined as Derf(x) = γ · erf(αx + s) + β (Eq. (10)). Across multiple modalities and tasks—ImageNet classification (ViT), ImageNet diffusion generation (DiT), speech self-supervised learning (wav2vec 2.0), DNA sequence modeling (HyenaDNA, Caduceus), and GPT-2 language modeling—Derf consistently matches or surpasses normalization baselines and improves over Dynamic Tanh (DyT) (Figure 1; Tables 8–12). The paper also argues these gains are driven primarily by improved generalization rather than better training-set fitting, using an “evaluation-mode training loss” analysis (Section 6.1, Table 13; Appendix E).

2. Context and Motivation¶

Problem / gap addressed
Transformers traditionally rely heavily on normalization layers (especially LayerNorm) to stabilize training by re-centering and re-scaling activations using statistics like mean and variance (Eq. (1); Section 2).
Normalization has drawbacks the paper highlights:
- Dependence on activation statistics introduces memory access and synchronization overhead (Section 1).
- Some normalization methods can be sensitive to batch size, causing instability under certain batch settings (Section 1).
Recent work Dynamic Tanh (DyT) shows that a statistics-free point-wise function can replace normalization and reach similar performance, but it remains unclear whether such functions can surpass normalization (Abstract; Section 1; Section 2).
Why it matters
If normalization can be replaced by simple elementwise functions, models may become simpler and potentially avoid normalization-specific overheads (Section 1).
The central practical goal is not merely “train without norms,” but to train better Transformers (higher accuracy / lower FID / lower validation loss) while remaining normalization-free (Abstract; Figure 1c; Tables 8–12).
Prior approaches and where they fall short (as positioned here)
Normalization variants (BN, LN, RMSNorm) stabilize training via statistic-based transformations (Eq. (1); Section 2).
Normalization-free methods exist (Section 8), and DyT in particular is a simple drop-in replacement:
- DyT(x) = γ · tanh(αx) + β (Eq. (3)).
- DyT “reaches normalization-level performance,” but the paper positions a gap: the design space and principles for stronger-than-normalization point-wise functions are not well understood (Abstract; Section 1–2).
How this paper positions itself
It provides:
1. A controlled analysis of four intrinsic properties of point-wise functions that govern stability/performance (Section 3; Figure 2; Tables 1–6).
2. A search over candidate point-wise functions that satisfy these properties (Section 4; Figure 5; Table 7).
3. A resulting function choice—Derf—that empirically outperforms normalization and DyT across domains (Section 5–6; Figure 1c; Tables 8–12).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a drop-in replacement for normalization layers in Transformer-like architectures, swapping each normalization operation with a simple elementwise nonlinearity with a few learnable parameters.
It solves “how to train stable, high-performing Transformers without normalization statistics” by designing a point-wise function with the right shape properties and validating it broadly across tasks and modalities (Sections 3–6).

3.2 Big-picture architecture (diagram in words)¶

Input activations inside a Transformer block (before attention, before FFN, and at the final normalization) normally pass through LayerNorm/RMSNorm.
The paper replaces each norm layer with a Derf layer:
Derf applies the same scalar mapping to every element (token × channel activation), using a global scalar scale α, scalar shift s, and per-channel affine parameters γ, β (Eq. (10); Section 5).
The rest of the model (attention, FFN, diffusion, etc.) remains unchanged; only normalization layers are replaced one-to-one (Section 5).

3.3 Roadmap for the deep dive¶

I explain:
What normalization does in this paper’s framing and how point-wise functions differ (Section 2).
The four function properties the paper isolates and how each is tested (Section 3; Figure 2; Tables 1–6).
How those principles constrain a candidate function set and how the search is run (Section 4; Figure 5; Table 7).
The final Derf formulation, initialization, and how it is inserted into Transformers (Section 5; Eq. (10)).
The evaluation protocols and what the results say about performance vs. fitting/generalization (Section 6–7; Tables 8–16; Appendix E).

3.4 Detailed, sentence-based technical breakdown¶

Framing: This is an empirical algorithm-and-system design paper: it proposes a specific point-wise operator (Derf) as a replacement for normalization layers, motivates it by a property analysis and function search, and validates it across tasks (Sections 3–6).

3.4.1 Normalization vs. point-wise functions (what is being replaced)¶

A normalization layer transforms activations by computing group statistics (mean μ and variance σ²) and then re-centering and scaling, followed by an affine transform (Eq. (1)).
The generic form is:
- y = γ * (x − μ) / sqrt(σ² + ε) + β (Eq. (1)).
For LayerNorm in Transformers, the statistics are computed per token over the channel/features dimension (Eq. (2); Section 2).
A point-wise function avoids statistics entirely by applying the same mapping to each element independently:
The paper writes this as y = γ · f(αx) + β (Eq. (4)) for property studies, where α is a learnable scalar that rescales inputs, and γ, β act like the affine parameters in normalization layers.

Key conceptual difference (Figure 1b): - LayerNorm uses relationships among channels/tokens via computed statistics, while DyT and Derf apply independent scalar mappings to each activation element (Figure 1b caption; Section 2).

3.4.2 Property analysis: what makes a point-wise function “work” as a normalization replacement?¶

The paper isolates four properties of point-wise functions (Figure 2; Section 3) and tests them in controlled modifications, using ViT-Base on ImageNet-1K with top-1 accuracy as the outcome (Section 3).

Zero-centeredness (Section 3.1; Figure 2; Table 1)
Meaning here: outputs are balanced around 0 (positive and negative values are symmetric).
The paper tests this by shifting functions horizontally or vertically:
- f_horiz(x) = f(x + λ_horiz), f_vert(x) = f(x) + λ_vert (Eq. (5)).
- λ ∈ {±0.5, ±1, ±2} (Section 3.1).
Result: small shifts (|λ| ≤ 0.5) are mostly tolerable, but large shifts degrade performance and can cause divergence at |λ| ≥ 2 (Table 1).
Boundedness (Section 3.2; Figure 2; Tables 2–4; Figure 3)
Meaning here: function outputs stay within a finite range, preventing activation/variance explosion.
Two testing methods:
- Clipping unbounded functions: y = clip(f_u(x), −λ_u, λ_u) (Eq. (6)) with λ_u ∈ {0.5, 0.8, 1.0, 2.0, 3.0, 5.0} (Section 3.2).
- Making bounded functions partially linear/unbounded: y = (1 − λ_b) f_b(x) + λ_b x, λ_b ∈ (0,1) with values {0.01, 0.1, 0.5} (Eq. (7); Section 3.2).
Results:
- Clipping improves performance relative to unbounded baselines (Table 2).
- Pushing bounded functions toward linear behavior reduces accuracy and can cause training failure at λ_b = 0.5 (Table 3).
- There is an “acceptable growth-rate limit” for unbounded functions; faster growth (e.g., linear(x), x^(2/3)) diverges (Table 4; Figure 3).
Center sensitivity (Section 3.3; Figure 2; Table 5)
Meaning here: the function responds strongly to small changes near 0 (where activations concentrate).
Test: introduce a symmetric “flat zone” around 0 where f(x)=0 for x ∈ [−λ, λ], with λ controlling how insensitive the function becomes near 0 (Section 3.3).
Result: best performance at λ=0 (no flat zone), and performance degrades as λ increases; divergence occurs at λ ≥ 3.0 (Table 5).
Monotonicity (Section 3.4; Figure 2; Figure 4; Table 6)
Meaning here: increasing input yields increasing (or decreasing) output, preserving ordering; non-monotonicity can flip gradients in regions (Section 3.4).
Tests:
- Compare monotonic increasing functions to their negated monotonic decreasing versions f_neg(x) = −f(x).
- Compare to non-monotonic hump-shaped and oscillatory functions (Figure 4).
- The paper rescales candidates so they match ranges and are aligned on other properties (Section 3.4).
Result: monotonic functions (increasing or decreasing) perform best; non-monotonic functions reduce accuracy (Table 6).

Design principle synthesized (Section 4): - Strong candidates should be near zero-centered, bounded, center-sensitive, and monotonic (Section 4, derived from Section 3 findings).

3.4.3 Function search: how Derf is chosen¶

The paper constructs a candidate set from common scalar function families and CDF-like shapes (polynomial, rational, exponential, logarithmic, trigonometric, and CDFs), then applies transformations (translation, scaling, mirroring, rotation, clipping) and retains only those meeting the four properties (Section 4; Appendix B overview).
All candidates are instantiated in a unified parametric form:
y = γ * f(αx + s) + β (Eq. (8)),
where:
- α is a learnable input scale,
- s is a learnable input shift (added because it “improves final performance” and is ablated later in Section 7.1),
- γ, β are affine parameters analogous to normalization layers.
The search is run on:
ViT-Base (ImageNet-1K top-1 accuracy),
DiT-B/4 and DiT-L/4 (ImageNet generation, FID) (Section 4 “Setup”).
The paper reports a broad comparison of candidate functions (Table 7; Figure 5 visualization).
In that table, erf(x) is the best among the tested point-wise functions on both ViT accuracy and DiT FID.

3.4.4 The proposed layer: Dynamic erf (Derf)¶

The error function is defined as:
erf(x) = (2/√π) ∫_0^x e^(−t²) dt (Eq. (9)).
Derf augments it with learnable shift and scale and affine per-channel parameters:
Derf(x) = γ · erf(αx + s) + β (Eq. (10)).
The paper specifies α and s as learnable scalars, while γ and β are learnable per-channel vectors (Section 5).
Where it is used in Transformers:
The paper replaces each normalization layer one-to-one with Derf:
- pre-attention norm, pre-FFN norm, and final norm (Section 5).
Initialization (Section 5):
γ initialized to all ones (vector),
β initialized to all zeros (vector),
α initialized to 0.5 (scalar),
s initialized to 0 (scalar).

Worked micro-example (illustrative, using Eq. (10)): - Suppose one activation element is x = 2.0, with initial parameters α=0.5, s=0, γ=1, β=0. - Then αx + s = 1.0, so the output is y = erf(1.0). - Since erf(·) is S-shaped and bounded in [−1, 1] (as implied by its use as a bounded candidate in Section 3–4 and shown among “natural functions” in Appendix B), the output remains finite and saturates for large |x|, which aligns with the boundedness motivation (Section 3.2).

(Note: the paper does not provide a numeric value for erf(1.0) in the excerpt; I am only walking through the computation graph defined by Eq. (10).)

3.4.5 Training/evaluation pipelines and hyperparameters (as provided)¶

The paper evaluates Derf as a drop-in replacement across several model families; below are the explicit training configurations included in the provided content.

ViT (ImageNet-1K supervised classification) - Optimizer: AdamW (Table 17). - Base learning rate: 4e-3 (Table 17). - Weight decay: 0.05 (Table 17). - Momentum: - ViT-B: (β1=0.9, β2=0.999) (Table 17). - ViT-L: (β1=0.9, β2=0.95) (Table 17). - Effective batch size: 4096 (Table 17). - LR schedule: cosine decay, warmup 20 epochs, total 300 epochs (Table 17). - Augmentations / regularizers: RandAugment-like setting (rand-m9-mstd0.5-inc1), label smoothing 0.1, mixup 0.8, cutmix 1.0, random erase 0.25, drop path 0.15 (B) / 0.5 (L), EMA 0.9999 (Table 17).

DiT (ImageNet diffusion generation) - Optimizer: AdamW (Table 18). - Base learning rate sweep: {1e-4, 2e-4, 4e-4}; best reported across these (Table 18 and text below it). - Weight decay: 0 (Table 18). - Effective batch size: 256 (Table 18). - LR schedule: constant; training epochs: 80; EMA: 0.9999 (Table 18). - Important implementation detail the paper notes: - “Zero initialization” is retained for LN models but removed for point-wise-function models because it harmed Derf/other point-wise models (text under Table 18).

wav2vec 2.0 (LibriSpeech SSL pretraining) - Optimizer: Adam (Table 19). - Learning rate: 5e-4 (Base), 3e-4 (Large) (Table 19). - Weight decay: 0.01; momentum (β1=0.9, β2=0.98) (Table 19). - Max updates: 400,000 (Base), 250,000 (Large); warmup updates: 32,000 (Base), 20,000 (Large) (Table 19). - Precision: paper changes all models to fp32 instead of default bf16 (Appendix C “Speech models” description). - Also, they retain the initial GroupNorm and a LayerNorm after the conv feature extractor as “data normalization,” and replace other normalization layers with DyT/Derf (Appendix C “Speech models”).

DNA models (HyenaDNA, Caduceus) - Optimizer: AdamW (Table 20). - HyenaDNA: learning rate 6e-4, sequence length 1024, effective batch size 1024, steps 10,000 (Table 20). - Caduceus: learning rate 8e-3, sequence length 131,072, effective batch size 8, steps 50,000 (Table 20). - Additional flags: RC augmentation true for HyenaDNA and false for Caduceus; MLM probability 0.0 (HyenaDNA) and 0.15 (Caduceus); bidirectional false (HyenaDNA) and true (Caduceus) (Table 20).

GPT-2 (124M, OpenWebText) - Optimizer: AdamW; base LR 6e-4; weight decay 0.1; momentum (β1=0.9, β2=0.95) (Table 21). - Gradient clipping: 1.0 (Table 21). - Context length / block size: 1024 (Table 21). - Training iterations: 300,000; warmup 2,000; LR schedule cosine decay (Table 21). - Dropout 0.0; mixed precision bf16 (Table 21). - The paper notes for GPT-2 that DyT/Derf require additional tuning of α initialization and reports the best validation loss over multiple initialization combinations (Appendix C “Language models”).

Architectural hyperparameters not provided in the excerpt - The user instruction requires listing items like number of layers, hidden size, attention heads, tokenizer, total training tokens, compute budget, and hardware. In the provided content, I see training hyperparameters and some task/model identifiers, but I do not see: - ViT/DiT architectural dimensions (layers, heads, hidden size) in this excerpt. - Tokenizer details and total training tokens/compute for GPT-2/OpenWebText. - Compute budget (e.g., PF-days) for any experiment. - I therefore cannot supply those specifics without inventing them.

4. Key Insights and Innovations¶

A principled shape-based recipe for replacing normalization with point-wise functions
Novelty: Instead of proposing one function outright, the paper isolates four function properties—zero-centeredness, boundedness, center sensitivity, monotonicity—and tests them independently via controlled modifications (Section 3; Figure 2; Tables 1–6).
Significance: This turns “try a few activations” into a more systematic design space, and it yields concrete failure modes (e.g., divergence at large shifts, unbounded growth) tied to specific properties (Tables 1–5).
A function search constrained by these properties that identifies erf as best-in-family
Novelty: The paper builds a broad candidate set via transformations and clipping while enforcing the four properties, then benchmarks candidates on both classification and diffusion generation (Section 4; Figure 5; Table 7).
Significance: Even among visually similar S-shaped functions, outcomes differ measurably; erf(x) is the strongest in their candidate set (Table 7).
Derf: adding learnable scale and shift to erf yields a strong normalization-free layer
Novelty: Derf(x) = γ · erf(αx + s) + β (Eq. (10)) extends the best base function from the search with learnable parameters in the same “affine wrapper” style as DyT and normalization layers.
Significance: It consistently improves performance over LayerNorm, RMSNorm, and DyT across multiple domains (Figure 1c; Tables 8–12; Appendix D).
Evidence that Derf’s advantage is primarily generalization, not better fitting
Novelty: The paper evaluates “fitting capacity” via evaluation-mode training loss computed after training with stochastic regularization and augmentations disabled (Section 6.1; Appendix E).
Significance: Training loss ranks Norm < Derf < DyT (Table 13), yet Derf’s downstream metrics are best in many settings (Tables 8–11), supporting the claim that point-wise functions act like an implicit regularizer (Section 6.1 Discussion).

5. Experimental Analysis¶

Evaluation methodology (datasets, metrics, baselines, setup)¶

Property analysis experiments (Section 3)
Model: ViT-Base (Section 3).
Dataset/metric: ImageNet-1K top-1 accuracy (Section 3).
Controlled manipulations:
- Shifts (Eq. (5); Table 1),
- Boundedness clamping (Eq. (6); Table 2),
- Boundedness removal (Eq. (7); Table 3),
- Center sensitivity flat region (Table 5),
- Monotonic vs non-monotonic sets (Figure 4; Table 6).
Function search (Section 4)
Models: ViT-Base, DiT-B/4, DiT-L/4.
Metrics:
- ViT: ImageNet-1K top-1 accuracy.
- DiT: FID using ImageNet reference batch evaluation (Section 4).
Candidate functions: listed and compared in Table 7; visualized in Figure 5.
Main cross-domain evaluation (Section 6; Appendix D)
Vision classification: ViT-B, ViT-L on ImageNet-1K (Table 8; plus Table 22 includes RMSNorm/GN).
Image generation: DiT-B/4, DiT-L/4, DiT-XL/2 on ImageNet (Table 9; plus Table 23 includes RMSNorm).
Speech SSL: wav2vec 2.0 Base/Large on LibriSpeech, metric is validation loss (Table 10; plus Table 24 includes RMSNorm).
DNA: HyenaDNA and Caduceus pretrained on human reference genome (GRCh38) and evaluated on GenomicBenchmarks with average accuracy over subtasks (Table 11; plus Table 25 includes LN/RMSNorm).
Language: GPT-2 124M on OpenWebText, validation loss (Table 12; plus Table 26 includes RMSNorm).

Main quantitative results (with numbers)¶

Figure 1c summary table (also repeated in later tables) shows consistent gains across domains:

LN: ViT acc 82.3%, DiT FID 45.91, DNA acc 86.9%
DyT: ViT acc 82.5%, DiT FID 45.66, DNA acc 86.9%
Derf: ViT acc 82.8%, DiT FID 43.94, DNA acc 87.3%
(Figure 1c)

Vision Transformers (Table 8):

ViT-B: LN 82.3%, DyT 82.5%, Derf 82.8% (Δ vs LN +0.5%, vs DyT +0.3%)
ViT-L: LN 83.1%, DyT 83.6%, Derf 83.8% (Δ vs LN +0.7%, vs DyT +0.2%)

Diffusion Transformers (Table 9):

DiT-B/4: LN 64.93, DyT 63.94, Derf 63.23 (lower is better)
DiT-L/4: LN 45.91, DyT 45.66, Derf 43.94
DiT-XL/2: LN 19.94, DyT 20.83, Derf 18.92

Speech (Table 10):

wav2vec 2.0 Base: LN 1.95, DyT 1.95, Derf 1.93
wav2vec 2.0 Large: LN 1.92, DyT 1.91, Derf 1.90

DNA (Table 11):

Hyena: Norm 85.2%, DyT 85.2%, Derf 85.7%
Caduceus: Norm 86.9%, DyT 86.9%, Derf 87.3%

Language (Table 12):

GPT-2: LN 2.94, DyT 2.97, Derf 2.94
Derf matches LN and improves over DyT by 0.03 in validation loss (Table 12).

Additional normalization baselines (Appendix D) - ViT (Table 22): Derf exceeds LN, DyT, RMSNorm, and GN (e.g., ViT-B Derf 82.8% vs RMSNorm 82.4%, GN 82.5%). - DiT (Table 23): Derf beats RMSNorm as well (e.g., DiT-L/4 Derf 43.94 vs RMSNorm 45.02). - wav2vec (Table 24): Derf yields lowest validation loss among LN/DyT/RMSNorm. - GPT-2 (Table 26): Derf matches LN (2.94) and is slightly better than RMSNorm (2.95), while beating DyT (2.97).

Do experiments support the claims?¶

Claim: point-wise functions can outperform normalization layers.
Supported in multiple settings where Derf beats LN/RMSNorm:
- ViT-B/L accuracy gains (Table 8; Table 22).
- DiT FID improvements across sizes (Table 9; Table 23).
- wav2vec validation loss reductions (Table 10; Table 24).
- DNA accuracy gains (Table 11; Table 25).
For GPT-2, Derf matches LN rather than surpassing it (Table 12; Table 26). The paper’s abstract says “outperforms … across a wide range of domains,” and the provided table indicates at least one case where it is tied rather than strictly better.
Claim: Derf improves generalization rather than fitting capacity.
The evaluation-mode training loss analysis shows Norm < Derf < DyT across architectures (Table 13), meaning Derf fits the training set worse than normalization, yet often yields better downstream metrics (Tables 8–11).
This is consistent evidence for the proposed interpretation in Section 6.1 Discussion, though it remains a correlational argument (the paper does not present a mechanistic proof in the provided excerpt).

Ablations, robustness checks, and failure cases¶

Effect of adding learnable shift s (Section 7.1; Table 14)
Adding s improves results for multiple functions on both ViT and DiT metrics.
Example from Table 14:
- erf(x) ViT: 82.6% → 82.8% with s
- erf(x) DiT-B/4 FID: 63.39 → 63.23 with s
Scalar vs per-channel vector s (Table 15)
Little difference; scalar is adopted for efficiency/simplicity.
Approximating erf with a scaled tanh(εx) (Section 7.2; Eq. (11); Table 16)
Best-fit ε ≈ 1.205 (Section 7.2).
tanh(εx) improves slightly over tanh(x) but remains worse than erf(x):
- ViT-B: tanh(x) 82.6%, tanh(εx) 82.7%, erf(x) 82.8% (Table 16).
- DiT-L: tanh(x) 45.48, tanh(εx) 45.13, erf(x) 43.94 (Table 16).
Training failure regimes identified in property studies
Large shifts (|λ| ≥ 2) cause training failure (Table 1).
Removing boundedness too aggressively causes failure (λ_b = 0.5 leads to × for multiple functions, Table 3).
Too low center sensitivity (large flat region, λ ≥ 3) causes failure (Table 5).
Fast-growing unbounded functions diverge (Table 4).

6. Limitations and Trade-offs¶

Missing architectural/compute disclosure in the provided excerpt
The paper includes many optimizer and schedule hyperparameters (Tables 17–21), but the excerpt does not specify key architectural dimensions (layers/heads/hidden size) or compute budgets/tokens/hardware details for each experiment. This limits reproducibility detail from the provided content alone.
Performance is not uniformly “strictly better” everywhere
On GPT-2 (124M), Derf matches LN rather than exceeding it (Table 12; Table 26). This suggests improvements may be task/model dependent, and language modeling may require more careful initialization/tuning (Appendix C “Language models”).
Fitting-capacity trade-off
Derf (and DyT) show higher evaluation-mode training loss than normalization layers across tested architectures (Table 13). If a use case prioritizes maximum training-set fit (e.g., memorization or minimizing training loss), normalization may still be favored.
Implementation/training caveats for specific model families
For DiT, the paper notes that:
- default learning rate is “suboptimal” and they sweep learning rates for all models (Table 18 text),
- “zero initialization” harms Derf/point-wise models and is removed for those but retained for LN (Table 18 text).
This indicates point-wise replacements may require nontrivial training-protocol adjustments in some architectures.
Scope of the function search
The function search is empirical and constrained to the candidate families and transformations used (Section 4; Appendix B). It does not guarantee global optimality over all possible point-wise mappings.
No formal theory of why erf is best
The paper provides property-based intuition and empirical evidence, but in the provided content it does not present a formal derivation proving Derf should outperform normalization or DyT; it remains an empirical finding (Sections 3–7).

7. Implications and Future Directions¶

How this changes the landscape
The paper strengthens the case that normalization is not only removable but can be surpassed by carefully shaped elementwise functions, at least in the tested settings (Figure 1; Tables 8–11; Appendix D).
The “property lens” (zero-centered, bounded, center-sensitive, monotonic) provides a reusable framework for reasoning about normalization-free operators beyond just DyT/Derf (Section 3; Figure 2).
Follow-up research directions suggested by the results
Deeper mechanistic understanding of generalization gains: The paper’s evidence suggests point-wise functions behave as implicit regularizers because they do not adapt to batch/token/channel statistics (Section 6.1 Discussion; Table 13). A next step would be to more directly quantify or model this regularization effect.
Broader scaling studies: The excerpt includes GPT-2 at 124M and wav2vec Base/Large, ViT B/L, and DiT up to XL/2. Extending to larger LMs and additional regimes would clarify where Derf consistently wins vs. ties (Tables 8–12).
Better initialization and training recipes: The paper already indicates sensitivity to initialization choices in DiT (Table 18 text) and requires tuning of α initialization for GPT-2 (Appendix C). More systematic recipes could reduce “gotchas.”
Practical applications / downstream use cases
Vision: If you train ViTs on ImageNet-like classification, Derf provides measurable accuracy improvements over LN/DyT under the provided training recipe (Table 8; Table 17).
Generative modeling: For DiT-style diffusion Transformers, Derf improves FID across multiple model sizes, but you may need to adopt the learning-rate sweep and remove zero initialization as described (Table 9; Table 18 text).
Speech/DNA: Derf improves validation loss/accuracy for wav2vec 2.0 and for HyenaDNA/Caduceus (Tables 10–11), suggesting it can transfer beyond vision.
Repro/Integration Guidance (when to prefer Derf vs alternatives)
Prefer Derf when:
- You want a drop-in norm replacement in Transformer-like models and you can keep the rest of the architecture fixed (Section 5).
- You care about downstream performance and can accept slightly worse fitting-capacity metrics (Table 13).
Prefer classic normalization when:
- You require the lowest training loss / strongest fitting under the same optimization, since normalization achieves the lowest evaluation-mode training loss in every listed architecture (Table 13).
Implementation checklist from the paper:
- Replace each norm layer one-to-one (pre-attn, pre-FFN, final) with Derf (Section 5).
- Initialize α=0.5, s=0, γ=1, β=0 (Section 5).
- For DiT-like training, consider their reported training-protocol tweaks (LR sweep; avoid zero init for point-wise models) (Table 18 text).
- For GPT-2-like training, be prepared to tune α initialization (Appendix C “Language models”).