Skip to content

Stronger Normalization-Free Transformers

ArXiv: 2512.10938

🎯 Pitch

The paper introduces Dynamic erf (Derf), a simple learnable point-wise replacement for normalization layers defined as Derf(x)=γ·erf(αx+s)+β, found via a principled search over function properties. Replacing LayerNorm/RMSNorm with Derf yields consistent gains across vision, generation, speech, DNA, and language tasks, reducing dependence on activation statistics and improving generalization while simplifying Transformer implementations.


1. Executive Summary (2-3 sentences)

This paper develops a stronger “normalization-free” Transformer design by replacing normalization layers (e.g., LayerNorm, RMSNorm) with a simple elementwise (point-wise) mapping called Dynamic erf (Derf), defined as Derf(x) = γ · erf(αx + s) + β (Eq. (10)). Across multiple modalities and tasks—ImageNet classification (ViT), ImageNet diffusion generation (DiT), speech self-supervised learning (wav2vec 2.0), DNA sequence modeling (HyenaDNA, Caduceus), and GPT-2 language modeling—Derf consistently matches or surpasses normalization baselines and improves over Dynamic Tanh (DyT) (Figure 1; Tables 8–12). The paper also argues these gains are driven primarily by improved generalization rather than better training-set fitting, using an “evaluation-mode training loss” analysis (Section 6.1, Table 13; Appendix E).


2. Context and Motivation

  • Problem / gap addressed
  • Transformers traditionally rely heavily on normalization layers (especially LayerNorm) to stabilize training by re-centering and re-scaling activations using statistics like mean and variance (Eq. (1); Section 2).
  • Normalization has drawbacks the paper highlights:
    • Dependence on activation statistics introduces memory access and synchronization overhead (Section 1).
    • Some normalization methods can be sensitive to batch size, causing instability under certain batch settings (Section 1).
  • Recent work Dynamic Tanh (DyT) shows that a statistics-free point-wise function can replace normalization and reach similar performance, but it remains unclear whether such functions can surpass normalization (Abstract; Section 1; Section 2).

  • Why it matters

  • If normalization can be replaced by simple elementwise functions, models may become simpler and potentially avoid normalization-specific overheads (Section 1).
  • The central practical goal is not merely “train without norms,” but to train better Transformers (higher accuracy / lower FID / lower validation loss) while remaining normalization-free (Abstract; Figure 1c; Tables 8–12).

  • Prior approaches and where they fall short (as positioned here)

  • Normalization variants (BN, LN, RMSNorm) stabilize training via statistic-based transformations (Eq. (1); Section 2).
  • Normalization-free methods exist (Section 8), and DyT in particular is a simple drop-in replacement:

    • DyT(x) = γ · tanh(αx) + β (Eq. (3)).
    • DyT “reaches normalization-level performance,” but the paper positions a gap: the design space and principles for stronger-than-normalization point-wise functions are not well understood (Abstract; Section 1–2).
  • How this paper positions itself

  • It provides:
    1. A controlled analysis of four intrinsic properties of point-wise functions that govern stability/performance (Section 3; Figure 2; Tables 1–6).
    2. A search over candidate point-wise functions that satisfy these properties (Section 4; Figure 5; Table 7).
    3. A resulting function choice—Derf—that empirically outperforms normalization and DyT across domains (Section 5–6; Figure 1c; Tables 8–12).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a drop-in replacement for normalization layers in Transformer-like architectures, swapping each normalization operation with a simple elementwise nonlinearity with a few learnable parameters.
  • It solves “how to train stable, high-performing Transformers without normalization statistics” by designing a point-wise function with the right shape properties and validating it broadly across tasks and modalities (Sections 3–6).

3.2 Big-picture architecture (diagram in words)

  • Input activations inside a Transformer block (before attention, before FFN, and at the final normalization) normally pass through LayerNorm/RMSNorm.
  • The paper replaces each norm layer with a Derf layer:
  • Derf applies the same scalar mapping to every element (token × channel activation), using a global scalar scale α, scalar shift s, and per-channel affine parameters γ, β (Eq. (10); Section 5).
  • The rest of the model (attention, FFN, diffusion, etc.) remains unchanged; only normalization layers are replaced one-to-one (Section 5).

3.3 Roadmap for the deep dive

  • I explain:
  • What normalization does in this paper’s framing and how point-wise functions differ (Section 2).
  • The four function properties the paper isolates and how each is tested (Section 3; Figure 2; Tables 1–6).
  • How those principles constrain a candidate function set and how the search is run (Section 4; Figure 5; Table 7).
  • The final Derf formulation, initialization, and how it is inserted into Transformers (Section 5; Eq. (10)).
  • The evaluation protocols and what the results say about performance vs. fitting/generalization (Section 6–7; Tables 8–16; Appendix E).

3.4 Detailed, sentence-based technical breakdown

Framing: This is an empirical algorithm-and-system design paper: it proposes a specific point-wise operator (Derf) as a replacement for normalization layers, motivates it by a property analysis and function search, and validates it across tasks (Sections 3–6).

3.4.1 Normalization vs. point-wise functions (what is being replaced)

  • A normalization layer transforms activations by computing group statistics (mean μ and variance σ²) and then re-centering and scaling, followed by an affine transform (Eq. (1)).
  • The generic form is:
    • y = γ * (x − μ) / sqrt(σ² + ε) + β (Eq. (1)).
  • For LayerNorm in Transformers, the statistics are computed per token over the channel/features dimension (Eq. (2); Section 2).
  • A point-wise function avoids statistics entirely by applying the same mapping to each element independently:
  • The paper writes this as y = γ · f(αx) + β (Eq. (4)) for property studies, where α is a learnable scalar that rescales inputs, and γ, β act like the affine parameters in normalization layers.

Key conceptual difference (Figure 1b): - LayerNorm uses relationships among channels/tokens via computed statistics, while DyT and Derf apply independent scalar mappings to each activation element (Figure 1b caption; Section 2).

3.4.2 Property analysis: what makes a point-wise function “work” as a normalization replacement?

The paper isolates four properties of point-wise functions (Figure 2; Section 3) and tests them in controlled modifications, using ViT-Base on ImageNet-1K with top-1 accuracy as the outcome (Section 3).

  1. Zero-centeredness (Section 3.1; Figure 2; Table 1)
  2. Meaning here: outputs are balanced around 0 (positive and negative values are symmetric).
  3. The paper tests this by shifting functions horizontally or vertically:
    • f_horiz(x) = f(x + λ_horiz), f_vert(x) = f(x) + λ_vert (Eq. (5)).
    • λ ∈ {±0.5, ±1, ±2} (Section 3.1).
  4. Result: small shifts (|λ| ≤ 0.5) are mostly tolerable, but large shifts degrade performance and can cause divergence at |λ| ≥ 2 (Table 1).

  5. Boundedness (Section 3.2; Figure 2; Tables 2–4; Figure 3)

  6. Meaning here: function outputs stay within a finite range, preventing activation/variance explosion.
  7. Two testing methods:
    • Clipping unbounded functions: y = clip(f_u(x), −λ_u, λ_u) (Eq. (6)) with λ_u ∈ {0.5, 0.8, 1.0, 2.0, 3.0, 5.0} (Section 3.2).
    • Making bounded functions partially linear/unbounded: y = (1 − λ_b) f_b(x) + λ_b x, λ_b ∈ (0,1) with values {0.01, 0.1, 0.5} (Eq. (7); Section 3.2).
  8. Results:

    • Clipping improves performance relative to unbounded baselines (Table 2).
    • Pushing bounded functions toward linear behavior reduces accuracy and can cause training failure at λ_b = 0.5 (Table 3).
    • There is an “acceptable growth-rate limit” for unbounded functions; faster growth (e.g., linear(x), x^(2/3)) diverges (Table 4; Figure 3).
  9. Center sensitivity (Section 3.3; Figure 2; Table 5)

  10. Meaning here: the function responds strongly to small changes near 0 (where activations concentrate).
  11. Test: introduce a symmetric “flat zone” around 0 where f(x)=0 for x ∈ [−λ, λ], with λ controlling how insensitive the function becomes near 0 (Section 3.3).
  12. Result: best performance at λ=0 (no flat zone), and performance degrades as λ increases; divergence occurs at λ ≥ 3.0 (Table 5).

  13. Monotonicity (Section 3.4; Figure 2; Figure 4; Table 6)

  14. Meaning here: increasing input yields increasing (or decreasing) output, preserving ordering; non-monotonicity can flip gradients in regions (Section 3.4).
  15. Tests:
    • Compare monotonic increasing functions to their negated monotonic decreasing versions f_neg(x) = −f(x).
    • Compare to non-monotonic hump-shaped and oscillatory functions (Figure 4).
    • The paper rescales candidates so they match ranges and are aligned on other properties (Section 3.4).
  16. Result: monotonic functions (increasing or decreasing) perform best; non-monotonic functions reduce accuracy (Table 6).

Design principle synthesized (Section 4): - Strong candidates should be near zero-centered, bounded, center-sensitive, and monotonic (Section 4, derived from Section 3 findings).

3.4.3 Function search: how Derf is chosen

  • The paper constructs a candidate set from common scalar function families and CDF-like shapes (polynomial, rational, exponential, logarithmic, trigonometric, and CDFs), then applies transformations (translation, scaling, mirroring, rotation, clipping) and retains only those meeting the four properties (Section 4; Appendix B overview).
  • All candidates are instantiated in a unified parametric form:
  • y = γ * f(αx + s) + β (Eq. (8)),
  • where:
    • α is a learnable input scale,
    • s is a learnable input shift (added because it “improves final performance” and is ablated later in Section 7.1),
    • γ, β are affine parameters analogous to normalization layers.
  • The search is run on:
  • ViT-Base (ImageNet-1K top-1 accuracy),
  • DiT-B/4 and DiT-L/4 (ImageNet generation, FID) (Section 4 “Setup”).
  • The paper reports a broad comparison of candidate functions (Table 7; Figure 5 visualization).
  • In that table, erf(x) is the best among the tested point-wise functions on both ViT accuracy and DiT FID.

3.4.4 The proposed layer: Dynamic erf (Derf)

  • The error function is defined as:
  • erf(x) = (2/√π) ∫_0^x e^(−t²) dt (Eq. (9)).
  • Derf augments it with learnable shift and scale and affine per-channel parameters:
  • Derf(x) = γ · erf(αx + s) + β (Eq. (10)).
  • The paper specifies α and s as learnable scalars, while γ and β are learnable per-channel vectors (Section 5).
  • Where it is used in Transformers:
  • The paper replaces each normalization layer one-to-one with Derf:
    • pre-attention norm, pre-FFN norm, and final norm (Section 5).
  • Initialization (Section 5):
  • γ initialized to all ones (vector),
  • β initialized to all zeros (vector),
  • α initialized to 0.5 (scalar),
  • s initialized to 0 (scalar).

Worked micro-example (illustrative, using Eq. (10)): - Suppose one activation element is x = 2.0, with initial parameters α=0.5, s=0, γ=1, β=0. - Then αx + s = 1.0, so the output is y = erf(1.0). - Since erf(·) is S-shaped and bounded in [−1, 1] (as implied by its use as a bounded candidate in Section 3–4 and shown among “natural functions” in Appendix B), the output remains finite and saturates for large |x|, which aligns with the boundedness motivation (Section 3.2).

(Note: the paper does not provide a numeric value for erf(1.0) in the excerpt; I am only walking through the computation graph defined by Eq. (10).)

3.4.5 Training/evaluation pipelines and hyperparameters (as provided)

The paper evaluates Derf as a drop-in replacement across several model families; below are the explicit training configurations included in the provided content.

ViT (ImageNet-1K supervised classification) - Optimizer: AdamW (Table 17). - Base learning rate: 4e-3 (Table 17). - Weight decay: 0.05 (Table 17). - Momentum: - ViT-B: (β1=0.9, β2=0.999) (Table 17). - ViT-L: (β1=0.9, β2=0.95) (Table 17). - Effective batch size: 4096 (Table 17). - LR schedule: cosine decay, warmup 20 epochs, total 300 epochs (Table 17). - Augmentations / regularizers: RandAugment-like setting (rand-m9-mstd0.5-inc1), label smoothing 0.1, mixup 0.8, cutmix 1.0, random erase 0.25, drop path 0.15 (B) / 0.5 (L), EMA 0.9999 (Table 17).

DiT (ImageNet diffusion generation) - Optimizer: AdamW (Table 18). - Base learning rate sweep: {1e-4, 2e-4, 4e-4}; best reported across these (Table 18 and text below it). - Weight decay: 0 (Table 18). - Effective batch size: 256 (Table 18). - LR schedule: constant; training epochs: 80; EMA: 0.9999 (Table 18). - Important implementation detail the paper notes: - “Zero initialization” is retained for LN models but removed for point-wise-function models because it harmed Derf/other point-wise models (text under Table 18).

wav2vec 2.0 (LibriSpeech SSL pretraining) - Optimizer: Adam (Table 19). - Learning rate: 5e-4 (Base), 3e-4 (Large) (Table 19). - Weight decay: 0.01; momentum (β1=0.9, β2=0.98) (Table 19). - Max updates: 400,000 (Base), 250,000 (Large); warmup updates: 32,000 (Base), 20,000 (Large) (Table 19). - Precision: paper changes all models to fp32 instead of default bf16 (Appendix C “Speech models” description). - Also, they retain the initial GroupNorm and a LayerNorm after the conv feature extractor as “data normalization,” and replace other normalization layers with DyT/Derf (Appendix C “Speech models”).

DNA models (HyenaDNA, Caduceus) - Optimizer: AdamW (Table 20). - HyenaDNA: learning rate 6e-4, sequence length 1024, effective batch size 1024, steps 10,000 (Table 20). - Caduceus: learning rate 8e-3, sequence length 131,072, effective batch size 8, steps 50,000 (Table 20). - Additional flags: RC augmentation true for HyenaDNA and false for Caduceus; MLM probability 0.0 (HyenaDNA) and 0.15 (Caduceus); bidirectional false (HyenaDNA) and true (Caduceus) (Table 20).

GPT-2 (124M, OpenWebText) - Optimizer: AdamW; base LR 6e-4; weight decay 0.1; momentum (β1=0.9, β2=0.95) (Table 21). - Gradient clipping: 1.0 (Table 21). - Context length / block size: 1024 (Table 21). - Training iterations: 300,000; warmup 2,000; LR schedule cosine decay (Table 21). - Dropout 0.0; mixed precision bf16 (Table 21). - The paper notes for GPT-2 that DyT/Derf require additional tuning of α initialization and reports the best validation loss over multiple initialization combinations (Appendix C “Language models”).

Architectural hyperparameters not provided in the excerpt - The user instruction requires listing items like number of layers, hidden size, attention heads, tokenizer, total training tokens, compute budget, and hardware. In the provided content, I see training hyperparameters and some task/model identifiers, but I do not see: - ViT/DiT architectural dimensions (layers, heads, hidden size) in this excerpt. - Tokenizer details and total training tokens/compute for GPT-2/OpenWebText. - Compute budget (e.g., PF-days) for any experiment. - I therefore cannot supply those specifics without inventing them.


4. Key Insights and Innovations

  1. A principled shape-based recipe for replacing normalization with point-wise functions
  2. Novelty: Instead of proposing one function outright, the paper isolates four function properties—zero-centeredness, boundedness, center sensitivity, monotonicity—and tests them independently via controlled modifications (Section 3; Figure 2; Tables 1–6).
  3. Significance: This turns “try a few activations” into a more systematic design space, and it yields concrete failure modes (e.g., divergence at large shifts, unbounded growth) tied to specific properties (Tables 1–5).

  4. A function search constrained by these properties that identifies erf as best-in-family

  5. Novelty: The paper builds a broad candidate set via transformations and clipping while enforcing the four properties, then benchmarks candidates on both classification and diffusion generation (Section 4; Figure 5; Table 7).
  6. Significance: Even among visually similar S-shaped functions, outcomes differ measurably; erf(x) is the strongest in their candidate set (Table 7).

  7. Derf: adding learnable scale and shift to erf yields a strong normalization-free layer

  8. Novelty: Derf(x) = γ · erf(αx + s) + β (Eq. (10)) extends the best base function from the search with learnable parameters in the same “affine wrapper” style as DyT and normalization layers.
  9. Significance: It consistently improves performance over LayerNorm, RMSNorm, and DyT across multiple domains (Figure 1c; Tables 8–12; Appendix D).

  10. Evidence that Derf’s advantage is primarily generalization, not better fitting

  11. Novelty: The paper evaluates “fitting capacity” via evaluation-mode training loss computed after training with stochastic regularization and augmentations disabled (Section 6.1; Appendix E).
  12. Significance: Training loss ranks Norm < Derf < DyT (Table 13), yet Derf’s downstream metrics are best in many settings (Tables 8–11), supporting the claim that point-wise functions act like an implicit regularizer (Section 6.1 Discussion).

5. Experimental Analysis

Evaluation methodology (datasets, metrics, baselines, setup)

  • Property analysis experiments (Section 3)
  • Model: ViT-Base (Section 3).
  • Dataset/metric: ImageNet-1K top-1 accuracy (Section 3).
  • Controlled manipulations:

    • Shifts (Eq. (5); Table 1),
    • Boundedness clamping (Eq. (6); Table 2),
    • Boundedness removal (Eq. (7); Table 3),
    • Center sensitivity flat region (Table 5),
    • Monotonic vs non-monotonic sets (Figure 4; Table 6).
  • Function search (Section 4)

  • Models: ViT-Base, DiT-B/4, DiT-L/4.
  • Metrics:
    • ViT: ImageNet-1K top-1 accuracy.
    • DiT: FID using ImageNet reference batch evaluation (Section 4).
  • Candidate functions: listed and compared in Table 7; visualized in Figure 5.

  • Main cross-domain evaluation (Section 6; Appendix D)

  • Vision classification: ViT-B, ViT-L on ImageNet-1K (Table 8; plus Table 22 includes RMSNorm/GN).
  • Image generation: DiT-B/4, DiT-L/4, DiT-XL/2 on ImageNet (Table 9; plus Table 23 includes RMSNorm).
  • Speech SSL: wav2vec 2.0 Base/Large on LibriSpeech, metric is validation loss (Table 10; plus Table 24 includes RMSNorm).
  • DNA: HyenaDNA and Caduceus pretrained on human reference genome (GRCh38) and evaluated on GenomicBenchmarks with average accuracy over subtasks (Table 11; plus Table 25 includes LN/RMSNorm).
  • Language: GPT-2 124M on OpenWebText, validation loss (Table 12; plus Table 26 includes RMSNorm).

Main quantitative results (with numbers)

Figure 1c summary table (also repeated in later tables) shows consistent gains across domains:

LN: ViT acc 82.3%, DiT FID 45.91, DNA acc 86.9%
DyT: ViT acc 82.5%, DiT FID 45.66, DNA acc 86.9%
Derf: ViT acc 82.8%, DiT FID 43.94, DNA acc 87.3%
(Figure 1c)

Vision Transformers (Table 8):

ViT-B: LN 82.3%, DyT 82.5%, Derf 82.8% (Δ vs LN +0.5%, vs DyT +0.3%)
ViT-L: LN 83.1%, DyT 83.6%, Derf 83.8% (Δ vs LN +0.7%, vs DyT +0.2%)

Diffusion Transformers (Table 9):

DiT-B/4: LN 64.93, DyT 63.94, Derf 63.23 (lower is better)
DiT-L/4: LN 45.91, DyT 45.66, Derf 43.94
DiT-XL/2: LN 19.94, DyT 20.83, Derf 18.92

Speech (Table 10):

wav2vec 2.0 Base: LN 1.95, DyT 1.95, Derf 1.93
wav2vec 2.0 Large: LN 1.92, DyT 1.91, Derf 1.90

DNA (Table 11):

Hyena: Norm 85.2%, DyT 85.2%, Derf 85.7%
Caduceus: Norm 86.9%, DyT 86.9%, Derf 87.3%

Language (Table 12):

GPT-2: LN 2.94, DyT 2.97, Derf 2.94
Derf matches LN and improves over DyT by 0.03 in validation loss (Table 12).

Additional normalization baselines (Appendix D) - ViT (Table 22): Derf exceeds LN, DyT, RMSNorm, and GN (e.g., ViT-B Derf 82.8% vs RMSNorm 82.4%, GN 82.5%). - DiT (Table 23): Derf beats RMSNorm as well (e.g., DiT-L/4 Derf 43.94 vs RMSNorm 45.02). - wav2vec (Table 24): Derf yields lowest validation loss among LN/DyT/RMSNorm. - GPT-2 (Table 26): Derf matches LN (2.94) and is slightly better than RMSNorm (2.95), while beating DyT (2.97).

Do experiments support the claims?

  • Claim: point-wise functions can outperform normalization layers.
  • Supported in multiple settings where Derf beats LN/RMSNorm:
    • ViT-B/L accuracy gains (Table 8; Table 22).
    • DiT FID improvements across sizes (Table 9; Table 23).
    • wav2vec validation loss reductions (Table 10; Table 24).
    • DNA accuracy gains (Table 11; Table 25).
  • For GPT-2, Derf matches LN rather than surpassing it (Table 12; Table 26). The paper’s abstract says “outperforms … across a wide range of domains,” and the provided table indicates at least one case where it is tied rather than strictly better.

  • Claim: Derf improves generalization rather than fitting capacity.

  • The evaluation-mode training loss analysis shows Norm < Derf < DyT across architectures (Table 13), meaning Derf fits the training set worse than normalization, yet often yields better downstream metrics (Tables 8–11).
  • This is consistent evidence for the proposed interpretation in Section 6.1 Discussion, though it remains a correlational argument (the paper does not present a mechanistic proof in the provided excerpt).

Ablations, robustness checks, and failure cases

  • Effect of adding learnable shift s (Section 7.1; Table 14)
  • Adding s improves results for multiple functions on both ViT and DiT metrics.
  • Example from Table 14:

    • erf(x) ViT: 82.6% → 82.8% with s
    • erf(x) DiT-B/4 FID: 63.39 → 63.23 with s
  • Scalar vs per-channel vector s (Table 15)

  • Little difference; scalar is adopted for efficiency/simplicity.

  • Approximating erf with a scaled tanh(εx) (Section 7.2; Eq. (11); Table 16)

  • Best-fit ε ≈ 1.205 (Section 7.2).
  • tanh(εx) improves slightly over tanh(x) but remains worse than erf(x):

    • ViT-B: tanh(x) 82.6%, tanh(εx) 82.7%, erf(x) 82.8% (Table 16).
    • DiT-L: tanh(x) 45.48, tanh(εx) 45.13, erf(x) 43.94 (Table 16).
  • Training failure regimes identified in property studies

  • Large shifts (|λ| ≥ 2) cause training failure (Table 1).
  • Removing boundedness too aggressively causes failure (λ_b = 0.5 leads to × for multiple functions, Table 3).
  • Too low center sensitivity (large flat region, λ ≥ 3) causes failure (Table 5).
  • Fast-growing unbounded functions diverge (Table 4).

6. Limitations and Trade-offs

  • Missing architectural/compute disclosure in the provided excerpt
  • The paper includes many optimizer and schedule hyperparameters (Tables 17–21), but the excerpt does not specify key architectural dimensions (layers/heads/hidden size) or compute budgets/tokens/hardware details for each experiment. This limits reproducibility detail from the provided content alone.

  • Performance is not uniformly “strictly better” everywhere

  • On GPT-2 (124M), Derf matches LN rather than exceeding it (Table 12; Table 26). This suggests improvements may be task/model dependent, and language modeling may require more careful initialization/tuning (Appendix C “Language models”).

  • Fitting-capacity trade-off

  • Derf (and DyT) show higher evaluation-mode training loss than normalization layers across tested architectures (Table 13). If a use case prioritizes maximum training-set fit (e.g., memorization or minimizing training loss), normalization may still be favored.

  • Implementation/training caveats for specific model families

  • For DiT, the paper notes that:
    • default learning rate is “suboptimal” and they sweep learning rates for all models (Table 18 text),
    • “zero initialization” harms Derf/point-wise models and is removed for those but retained for LN (Table 18 text).
  • This indicates point-wise replacements may require nontrivial training-protocol adjustments in some architectures.

  • Scope of the function search

  • The function search is empirical and constrained to the candidate families and transformations used (Section 4; Appendix B). It does not guarantee global optimality over all possible point-wise mappings.

  • No formal theory of why erf is best

  • The paper provides property-based intuition and empirical evidence, but in the provided content it does not present a formal derivation proving Derf should outperform normalization or DyT; it remains an empirical finding (Sections 3–7).

7. Implications and Future Directions

  • How this changes the landscape
  • The paper strengthens the case that normalization is not only removable but can be surpassed by carefully shaped elementwise functions, at least in the tested settings (Figure 1; Tables 8–11; Appendix D).
  • The “property lens” (zero-centered, bounded, center-sensitive, monotonic) provides a reusable framework for reasoning about normalization-free operators beyond just DyT/Derf (Section 3; Figure 2).

  • Follow-up research directions suggested by the results

  • Deeper mechanistic understanding of generalization gains: The paper’s evidence suggests point-wise functions behave as implicit regularizers because they do not adapt to batch/token/channel statistics (Section 6.1 Discussion; Table 13). A next step would be to more directly quantify or model this regularization effect.
  • Broader scaling studies: The excerpt includes GPT-2 at 124M and wav2vec Base/Large, ViT B/L, and DiT up to XL/2. Extending to larger LMs and additional regimes would clarify where Derf consistently wins vs. ties (Tables 8–12).
  • Better initialization and training recipes: The paper already indicates sensitivity to initialization choices in DiT (Table 18 text) and requires tuning of α initialization for GPT-2 (Appendix C). More systematic recipes could reduce “gotchas.”

  • Practical applications / downstream use cases

  • Vision: If you train ViTs on ImageNet-like classification, Derf provides measurable accuracy improvements over LN/DyT under the provided training recipe (Table 8; Table 17).
  • Generative modeling: For DiT-style diffusion Transformers, Derf improves FID across multiple model sizes, but you may need to adopt the learning-rate sweep and remove zero initialization as described (Table 9; Table 18 text).
  • Speech/DNA: Derf improves validation loss/accuracy for wav2vec 2.0 and for HyenaDNA/Caduceus (Tables 10–11), suggesting it can transfer beyond vision.

  • Repro/Integration Guidance (when to prefer Derf vs alternatives)

  • Prefer Derf when:
    • You want a drop-in norm replacement in Transformer-like models and you can keep the rest of the architecture fixed (Section 5).
    • You care about downstream performance and can accept slightly worse fitting-capacity metrics (Table 13).
  • Prefer classic normalization when:
    • You require the lowest training loss / strongest fitting under the same optimization, since normalization achieves the lowest evaluation-mode training loss in every listed architecture (Table 13).
  • Implementation checklist from the paper:
    • Replace each norm layer one-to-one (pre-attn, pre-FFN, final) with Derf (Section 5).
    • Initialize α=0.5, s=0, γ=1, β=0 (Section 5).
    • For DiT-like training, consider their reported training-protocol tweaks (LR sweep; avoid zero init for point-wise models) (Table 18 text).
    • For GPT-2-like training, be prepared to tune α initialization (Appendix C “Language models”).