Skip to content

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

ArXiv: 2510.25602

🎯 Pitch

This paper delivers a unified, theoretical and empirical comparison of fine-grained low-bit integer (INT) vs. floating-point (FP) quantization across block-wise formats (MX, NV), revealing a crossover: while FP wins at coarse granularity, MXINT8 often outperforms MXFP8 in both accuracy and hardware efficiency at block sizes used in practice. It also proposes practical fixes—e.g., symmetric clipping for INT training and Hadamard rotation for outlier mitigation—enabling nearly lossless MXINT8 training and compelling algorithm–hardware guidance that challenges the current FP-centric accelerator trend.


1. Executive Summary (2-3 sentences)

This paper builds a unified, fine-grained comparison between low-bit integer (INT) and low-bit floating-point (FP) quantization formats for Large Language Models (LLMs), focusing on modern block-wise schemes like MX (block size 32) and NV (block size 16) (Table 1; §2.4). It identifies a “crossover” behavior: FP tends to be better under coarse granularity (large dynamic range needed for activation outliers), but at fine granularity INT—especially MXINT8—can be simultaneously more accurate and more hardware-efficient than FP alternatives (Figure 3; Table 3; Table 5). The paper also introduces a symmetric clipping fix for INT training that removes a gradient bias and enables nearly lossless MXINT8 training compared to BF16 (Figure 2; §3.2; Figure 5; Table 4; Table 10–11).


2. Context and Motivation

  • Problem / gap addressed
  • LLMs have large compute and memory demands, so quantization (storing/operating on fewer bits) is important for efficiency (§1).
  • A central challenge is activation outliers: rare but very large values in activation distributions that expand the required dynamic range, hurting low-bit quantization (§1).
  • Industry hardware trends increasingly support low-bit FP formats (e.g., FP8/FP4) because FP’s exponent gives larger dynamic range, which helps with outliers (§1; §2.4).

  • What was missing

  • The paper argues there is no unified comparison of FP vs INT across quantization granularities (per-channel vs block-wise, etc.), which complicates algorithm–hardware co-design decisions (§1).

  • Why granularity matters

  • Quantization granularity controls how scaling is applied:
    • Per-tensor: one scale for a whole tensor.
    • Per-channel: one scale per channel.
    • Block-k: one scale per contiguous block of k values (§2.3).
  • Block-wise quantization is now common for outlier mitigation, so the FP-vs-INT trade-off should be re-evaluated under these fine-grained regimes (§1; §2.3–§2.4).

  • How the paper positions itself

  • It frames FP vs INT as a function of: 1) granularity (block size), 2) bit-width (8/6/4-bit), 3) data distribution outlier-ness, summarized by crest factor κ (Eq. (11)).
  • It introduces integer counterparts to hardware-relevant FP block formats so comparisons are “matched” (e.g., MXFP8 vs MXINT8, NVFP4 vs NVINT4) (Table 1; §2.4).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system studied is a set of low-bit number formats + scaling rules used to quantize tensors inside Transformer linear layers during inference and training.
  • It solves the problem of deciding when INT or FP is better under block-wise quantization, using a theoretical metric (QSNR) plus empirical measurements on real LLM tensors and downstream inference/training evaluations (Eq. (10); §4–§6).

3.2 Big-picture architecture (diagram in words)

  • (A) Quantization definitions
  • INT quantization with clipping and scaling (Eq. (1)).
  • FP quantization via nearest representable FP value after scaling (Eq. (3)), with the FP codebook defined by exponent/mantissa fields (Eq. (2); §2.2).
  • (B) Block-wise formats
  • MX formats: block size 32 with a shared per-block UE8M0 scale (Table 1; §2.4).
  • NV formats: block size 16 with a per-block E4M3 scale and an additional per-tensor scale (“Scale-2”) (Table 1; §2.4).
  • (C) Evaluation pipeline
  • A theoretical framework predicts quantization fidelity via QSNR as a function of crest factor and format parameters (Theorems 1–2; Eq. (13)–(14); Figure 3).
  • Empirical tensor capture + QSNR measurement validates theory on real tensors (Figure 1; §5.1; Table 2; Figure 4).
  • Downstream tests: direct-cast inference behavior change via KL divergence, and low-bit training loss/accuracy vs BF16 (Table 3; Table 12–15; Figure 5; Table 4).
  • (D) Hardware model
  • A gate-level cost model compares normalized area/energy for MAC-centric MMUs under INT vs FP formats (Sec. C; Table 5; Table 6–7).

3.3 Roadmap for the deep dive

  • First, define the quantization operations and the role of scales and granularity (§2–§3).
  • Next, define the fidelity metric QSNR and the key distribution summary crest factor κ (Eq. (10)–(11)).
  • Then, explain the two QSNR theorems—why INT depends strongly on κ and bit-width, and why FP is bounded by mantissa bits and subnormals (Theorems 1–2; Eq. (13)–(14)).
  • After that, connect theory to measured real-tensor crest factors and QSNR (Table 2; Figure 4).
  • Finally, cover the downstream inference/training results and hardware cost trade-offs (Table 3–5; Figure 5; Sec. C).

3.4 Detailed, sentence-based technical breakdown

  • Paper type and core idea.
  • This is a combined theoretical + empirical systems paper: it derives analytical approximations for quantization error (via QSNR) and validates them with real LLM tensor statistics, then links those findings to inference, training, and hardware cost comparisons (Sec. 4–6).

  • Quantization: what happens mathematically.

  • For INT quantization, the paper uses a standard “scale → round → clip → dequantize” mapping:
    • A high-precision tensor X is scaled by s, rounded to the nearest integer, clipped into the representable integer range, then scaled back (Eq. (1); §2.1).
  • For FP quantization, the tensor is similarly scaled, but the normalized value is mapped to the nearest representable low-bit floating-point value (CFP) defined by exponent/mantissa rules, then scaled back (Eq. (2)–(3); §2.2).
  • The paper emphasizes that both INT and FP quantizers can be written in a shared form (Eq. (3)) by swapping the “codebook” (CFP vs an integer codebook).

  • Granularity and why block-wise matters.

  • In block-k quantization, the tensor is partitioned into blocks (e.g., length 32 for MX), and each block gets its own scale factor (§2.3).
  • The key intuition the paper uses is: smaller blocks reduce the local dynamic range, which can reduce outlier impact within each block and change whether FP’s exponent advantage is still necessary (§1; §4.2).

  • Block formats compared (what exactly are MX vs NV).

  • MX formats (from OCP microscaling) use:
    • Block size 32.
    • A shared per-block UE8M0 scale factor (8-bit unsigned float with exponent bits only) (Table 1; §2.4; footnote on UE8M0).
    • The data in the block is stored in a low-bit type like E4M3 (MXFP8), E2M3 (MXFP6), or E2M1 (MXFP4) (Table 1; §2.4).
  • NVFP4 modifies MXFP4 by:

    • Using block size 16.
    • Using an E4M3 scale and adding a second-level per-tensor scale to avoid overflow (Table 1; §2.4).
  • Scale factor computation and its overhead.

  • Scales start from an AbsMax strategy:
    • Compute s = AbsMax(X)/Qmax where AbsMax(X) is the maximum absolute value in the sharing group and Qmax is the maximum representable value of the quantized type (Eq. (7); §3.2).
  • For MX, scales are quantized into UE8M0. The paper notes that rounding the scale down can introduce extra clipping error (Eq. (8); §3.2), so it uses rounding up:
    • s′ = 2^{clip(ceil(log2(s)), -127, 127)} (Eq. (9); §3.2).
  • In the theory, this “scale quantization” is abstracted as an overhead factor ρ where s′ = ρ s (Eq. (12)), with ρ ∈ [1,2) for UE8M0, and approximately ρ = 1 for higher-precision scales like E4M3 (Sec. 4.1 assumptions).

  • Training/inference compute flow: where quantization is applied.

  • The paper explicitly models quantization inside a linear layer’s GEMMs for forward and backward propagation (Figure 1; §3.1).
  • Forward pass:
    • Quantize activations X and weights W, then GEMM to produce Y (Eq. (4); Figure 1 marks this as operations ① and ②).
  • Backward pass:
    • Compute dX from quantized dY and quantized W^T (Eq. (5); Figure 1 operations ③ and ④).
    • Compute dW from quantized X^T and quantized dY^T (Eq. (6); Figure 1 operations ⑤ and ⑥).
  • A practical hardware constraint is noted: block-wise quantization should align with the GEMM reduction dimension to gain hardware benefits (§3.1).

  • Symmetric clipping to fix INT training bias (paper’s algorithmic fix).

  • Two’s complement signed INT ranges are asymmetric: for b-bit signed INT, there is one extra negative value (e.g., [-128, 127] for INT8) (§3.2).
  • The paper finds this usually does not harm inference but degrades INT training due to a persistent negative gradient bias, especially for finer-grained quantization where more blocks exist and more values can hit the unique negative endpoint (Figure 2; §3.2).
  • The proposed fix is to enforce a symmetric INT range:
    • Qmin = -(2^{b-1} - 1), Qmax = 2^{b-1} - 1 (stated in §3.2; also reflected in Table 1’s “MXINT8 127” range and the mention of [-127, 127] in Figure 2 legend).
  • The appendix provides an ablation showing training loss differences under [-128,127] vs [-127,127] across block sizes and scale types (Table 10), and attributes part of the problem to limited precision in BF16 scale computations (Algorithm 1; Table 11; §D.2).

  • Theoretical framework: QSNR and crest factor.

  • The paper uses QSNR (Quantization Signal-to-Noise Ratio) to quantify numerical fidelity:
    • QSNR = -10 log10( ||X - Xq||^2 / ||X||^2 ) in dB (Eq. (10); §4.1).
    • Higher QSNR means less quantization noise relative to signal energy.
  • It models per-block data X ∈ R^k with i.i.d. Gaussian entries and summarizes outlier strength using the crest factor:

    • κ = max(|X|)/σ where σ is the block RMS (Eq. (11); §4.1).
    • Intuitively, larger κ means “more extreme outliers relative to typical scale.”
  • Theorem 1: INT QSNR scaling law (what it says and why).

  • Theorem 1 approximates INT QSNR as (Eq. (13)):
    • For UE8M0-scaled MX-like settings: QSNRINT ≈ 4.78 + 6.02 b − 20 log10(ρ) − 20 log10(κ).
    • For E4M3-scaled NV-like settings: similar but without ρ, plus a block-size factor 10 log10(g/(g−1)) (Eq. (13)).
  • Mechanistic interpretation given in §4.1:

    • Each additional bit gives ~6.02 dB (because uniform quantization error shrinks exponentially with bit-width).
    • Scale rounding overhead ρ can cost up to 6.02 dB.
    • Larger crest factor κ reduces QSNR; smaller blocks tend to reduce κ.
  • Theorem 2: FP QSNR structure (mantissa-bound + subnormal effects).

  • Theorem 2 approximates FP QSNR as a function of:
    • A mantissa-resolution term α_M (depends on mantissa bit-width M),
    • Energy fraction in the normal region w_norm,
    • Probability of being subnormal p_sub,
    • Crest factor κ, scale overhead ρ, and block size g (Eq. (14); §4.1).
  • Key interpretation (§4.1):

    • With enough dynamic range so almost everything is “normal” (i.e., w_norm ≈ 1 and p_sub ≈ 0), FP QSNR is bounded by mantissa bits:
    • QSNR ≈ 13.80 + 6.02 M dB, largely independent of block size and distribution.
    • As crest factor grows, more values fall into subnormals and QSNR drops.
    • Finer blocks help by reducing κ (reducing subnormals) but FP still faces a mantissa-limited ceiling.
  • Crossover prediction and its operational meaning.

  • Combining Theorem 1 and 2, the paper plots theoretical QSNR vs crest factor for multiple formats and identifies crossover crest factors where INT becomes better than FP (Figure 3; §4.2).
  • Reported crossover points in §4.2 / Figure 3 include:
    • MXINT8 beats MXFP8 when κ < 7.55.
    • MXINT6 beats MXFP6 only when κ < 1.96.
    • MXINT4 beats MXFP4 only when κ < 2.04.
    • NVINT4 beats NVFP4 when κ < 2.39.
  • This sets up the empirical strategy: measure real-tensor κ under different block sizes and see which side of the crossover typical tensors fall on (§5.1; Table 2).

  • Worked micro-example (single block) to illustrate the mechanism (constructed from the paper’s definitions).

  • Suppose you quantize a block of g = 32 values with AbsMax scaling (Eq. (7)) using an INT4 codebook [-7, 7] (Table 1).
  • Step 1: compute AbsMax(X) over the 32 values, then s = AbsMax(X)/7 (Eq. (7)).
  • Step 2: (MX case) quantize the scale to a power-of-two s′ via UE8M0 rounding-up (Eq. (9)), which introduces overhead ρ where s′ = ρ s (Eq. (12)).
  • Step 3: for each value x in the block, compute q = round(x/s′), clamp q into [-7,7], and dequantize x_q = q s′ (Eq. (1) / Eq. (19) in Appendix B.2).
  • If the block has a large outlier (large κ), AbsMax(X) grows, s′ grows, and the step size grows—so most non-outlier values get coarsely quantized, lowering QSNR (captured by the -20 log10(κ) term in Eq. (13)).
  • If you reduce block size (e.g., from per-channel to 32 or 16), the maximum within each block tends to be smaller relative to RMS (lower κ), so INT quantization becomes more competitive (Table 2; §5.1).

4. Key Insights and Innovations

  • 1) “Crossover” framing: INT vs FP depends on crest factor and granularity, not just “FP is better for outliers.”
  • Novelty: The paper ties FP-vs-INT comparison to a single interpretable quantity, the crest factor κ, and shows how smaller blocks reduce κ, shifting the trade-off (Eq. (11); Figure 3; Table 2).
  • Significance: This provides a concrete design knob (block size) to decide whether INT’s uniform precision beats FP’s dynamic range for a given workload (§1; §4.2; §5.1).

  • 2) Theoretical QSNR framework for both INT and FP under block-wise scaling.

  • Novelty: The paper provides explicit approximations for QSNR under INT (Theorem 1; Eq. (13)) and FP (Theorem 2; Eq. (14)) that incorporate:
    • bit-width b (INT),
    • mantissa bits M (FP),
    • scale overhead ρ,
    • block size g,
    • crest factor κ.
  • Significance: This enables format-level predictions like “MXINT8 should beat MXFP8 for most blocks if κ is typically below 7.55” (Figure 3; §4.2), which the paper later corroborates empirically (Figure 4; Table 3).

  • 3) Empirical finding: MXINT8 dominates MXFP8 in fine-grained settings for both inference and training.

  • Inference: MXINT8 wins over MXFP8 on all 12 evaluated models in direct-cast inference, with or without random Hadamard rotation (Table 3; per-model KL in Table 12–13; perplexity in Table 14–15).
  • Training: MXINT8 achieves nearly BF16-matching loss and task accuracy, and slightly lower loss than MXFP8 in their runs (Figure 5; Table 4).

  • 4) Algorithmic training fix: symmetric clipping to remove INT gradient bias under fine granularity.

  • Novelty: The paper isolates an INT-training-specific failure mode from two’s-complement asymmetry and BF16 scale precision, and fixes it by enforcing [-(2^{b-1}-1), 2^{b-1}-1] instead of the default INT range (§3.2; Figure 2; Table 10–11).
  • Significance: This turns INT8 training from “degraded” to “nearly lossless” in their reported experiments (Figure 5; Table 4; Table 10).

  • 5) Hardware cost analysis suggesting INT block formats are materially cheaper at matched throughput.

  • Novelty: A gate-level MAC-centric model (Sec. C) is used to compare normalized energy/area across formats, including mixed-format chips with a 1:2 throughput ratio for 8-bit vs 4-bit (Table 5; Table 7).
  • Significance: Reported normalized costs show substantial energy/area advantages for INT pipelines (Table 5), challenging an FP-only hardware trajectory (§6; §7).

5. Experimental Analysis

Evaluation methodology

  • Tensor-wise QSNR & crest factor measurement (§5.1)
  • Model: Llama-3.1-8B.
  • Data: 8 WikiText2 sequences, length 4096 (§5.1).
  • Procedure:

    • Run forward+backward in BF16 and capture six intermediate tensors per linear layer corresponding to Figure 1’s quantization points ①–⑥ (Eq. (4)–(6)).
    • Across 224 linear layers × 6 tensors = 10,752 tensors (§5.1).
    • Compute block-wise crest factor κ and tensor-wise QSNR under different formats and block sizes.
    • Also test random Hadamard rotation (dimension 32×32) as an outlier suppression technique (§5.1).
  • Direct-cast inference (§5.2)

  • Formats compared: MXFP8/MXINT8, MXFP6/MXINT6, MXFP4/MXINT4, NVFP4/NVINT4 (Table 1; §5.2).
  • Models: 12 LLMs, spanning dense and MoE, from 0.6B up to 235B parameters (§5.2; Table 8 lists Hugging Face IDs).
  • Quantization scope: quantize all forward GEMMs (“direct-cast inference from a pretrained BF16 model”) (§5.2).
  • Metric:
    • KL divergence on WikiText2 between quantized model output distribution and BF16 counterpart, computed on the top-25 BF16 logits (to reduce noise), averaged over tokens and scaled by 10^6 in tables (Table 12–13; §5.2).
  • Also evaluated with random Hadamard rotation by quantizing XR and R^T W with h equal to block size (32 for MX, 16 for NV) (§5.2).

  • Low-bit training (§5.3)

  • Focus: only 8-bit “nearly lossless” training; compare MXINT8 vs MXFP8 because prior work already shows near-lossless FP8 training (as referenced by the paper) (§5.3).
  • Models and data:
    • 1B and 3B Llama-3-style models (§D.1), trained on OLMo2-Mix-1124 dataset with 100B and 200B tokens (§5.3).
  • Metrics:
    • Training loss curves (EMA smoothing coefficient 0.9) (Figure 5; §5.3).
    • 5-shot evaluation accuracy via lm_eval on six tasks (Arc_Easy, Arc_Challenge, HellaSwag, OpenbookQA, PIQA, WinoGrande) (Table 4; §5.3).

Main quantitative results (with numbers)

  • Crest factor statistics show granularity shifts κ into INT-favorable regimes (§5.1).
  • Table 2 (Q3 = 75% quantile) shows:
    • Per-channel (block size “-1”): Q3 crest factor 11.97 (very high, above many crossovers).
    • Block 32: Q3 2.96.
    • Block 16: Q3 2.39.
    • With Hadamard rotation, Q3 drops further (e.g., block 16 Q3 2.11) (Table 2).
  • The paper interprets this against Figure 3 crossovers:

    • 2.96 < 7.55 ⇒ MXINT8 should beat MXFP8 “in most cases.”
    • But 2.96 > 1.96 and > 2.04 ⇒ MXINT6 and MXINT4 should lose vs FP counterparts (Table 2; §5.1; §4.2).
  • Measured QSNR over 10,752 real tensors matches theoretical predictions (Figure 4; §5.1).

  • Without rotation (Figure 4a):
    • MXINT8 average QSNR 40.35 dB vs MXFP8 31.50 dB; INT win-rate 100% for this dataset of tensors (Figure 4a).
    • MXINT6 average 28.57 dB vs MXFP6 30.39 dB; FP wins 99.9% (Figure 4a).
    • MXINT4 average 15.89 dB vs MXFP4 17.98 dB; FP wins 99.7% (Figure 4a).
    • NVINT4 average 20.55 dB vs NVFP4 20.60 dB; INT win-rate 64.3% but slightly lower average (Figure 4a).
  • With Hadamard rotation (Figure 4b):

    • NVINT4 average 21.65 dB vs NVFP4 20.35 dB; INT wins 99.3% (Figure 4b).
    • The paper notes NVFP4 QSNR can decrease when κ decreases below ~4 due to its normal vs subnormal error trade-off (consistent with Figure 3 and Eq. (14)) (§5.1).
  • Direct-cast inference: consistent wins/losses by bit-width and rotation (Table 3; Table 12–13).

  • Summary win counts across 12 models (Table 3):
    • MXINT8 vs MXFP8: INT wins 12/12 (both original and with rotation).
    • MXINT6 vs MXFP6: FP wins 12/12 originally; with rotation INT wins 1/12.
    • MXINT4 vs MXFP4: FP wins 12/12 (original and with rotation).
    • NVINT4 vs NVFP4: FP wins 12/12 originally; with rotation INT wins 12/12.
  • Per-model KL divergence tables (Table 12–13) concretely reflect these patterns; e.g., for Qwen3-8B:
    • Original: NVINT4 5120 vs NVFP4 3430 (FP better).
    • With rotation: NVINT4 4026 vs NVFP4 5912 (INT better) (Table 12).
  • Perplexity on WikiText2 (Table 14–15) is broadly consistent with KL trends, and shows MXINT8 is very close to BF16 perplexity in many cases (e.g., Qwen3-8B BF16 6.5135 vs MXINT8 6.5174) (Table 14).

  • Low-bit training: MXINT8 is nearly lossless and comparable to BF16, slightly better loss than MXFP8 (Figure 5; Table 4).

  • Loss curves (Figure 5): BF16, MXFP8, and MXINT8 curves “almost overlap,” with MXINT8 lower by ~0.001 loss in the zoomed view (§5.3).
  • Final reported losses and task accuracies (Table 4):

    • 1B / 100B tokens: BF16 loss 2.6727, MXFP8 2.6767, MXINT8 2.6758.
    • 3B / 200B tokens: BF16 loss 2.4794, MXFP8 2.4821, MXINT8 2.4812.
    • Average task accuracy: 1B BF16 56.89, MXFP8 56.86, MXINT8 57.02; 3B BF16 64.45, MXFP8 64.05, MXINT8 64.30 (Table 4).
  • Symmetric clipping ablation: asymmetric INT8 range hurts training (Figure 2; Table 10; §D.2).

  • Figure 2 shows higher training loss when using [-128,127] compared to [-127,127] on a 145M model trained with 20B tokens, particularly as granularity becomes finer (block size 32).
  • Table 10 quantifies this across granularities and scale types:
    • For BF16 scale at block 32: [-128,127] loss 3.1354 vs [-127,127] 3.1251.
    • Similar but smaller effects under UE8M0 scales (Table 10).
  • Algorithm 1 / Table 11 further support that BF16 arithmetic can cause unintended mapping to -128 even when scaling is “theoretically” set to avoid it (Table 11 reports 16.82% ratio for BF16 vs 0 for FP32).

  • Hardware cost analysis: normalized energy/area favors INT (Sec. 6; Table 5).

  • At matched throughput, normalized costs (Table 5):
    • Single-format: MXINT8 energy 0.63× and area 0.79× relative to MXFP8 baseline; NVINT4 energy 0.34× and area 0.38× relative to MXFP8 baseline (Table 5).
    • Mixed-format (8-bit + 4-bit at throughput ratio 1:2):
    • MXINT8+NVINT4 energy 0.75× and area 0.66× relative to MXFP8+NVFP4 baseline (Table 5).
  • The model scope explicitly excludes quantizer cost and focuses on MAC + dequantizer + FP32 accumulator (Sec. C), so the conclusions are “within that model.”

Do the experiments support the claims?

  • The paper’s central empirical claims are strongly aligned across:
  • theory (Figure 3),
  • real-tensor crest factor distributions (Table 2),
  • real-tensor QSNR measurements (Figure 4),
  • model-level inference metrics across 12 LLMs (Table 3; Table 12–15),
  • and low-bit training comparisons for 8-bit (Figure 5; Table 4),
  • plus a hardware cost model (Table 5; Sec. C).
  • The most conditional claim is the 4-bit NV result:
  • Without rotation, NVFP4 wins 12/12; with Hadamard rotation, NVINT4 wins 12/12 (Table 3).
  • This is consistent with the paper’s framing that format choice depends on outlier mitigation changing crest factor (κ) (Table 2; Figure 4b).

6. Limitations and Trade-offs

  • Theory relies on stylized distribution assumptions.
  • Theorem derivations assume i.i.d. Gaussian block entries (Xi ~ N(0, σ^2)) (§4.1; Appendix B.1).
  • Real LLM tensors may deviate (heavy tails, correlations, structured outliers). The paper partially addresses this by measuring real tensors (Table 2; Figure 4), but the theoretical accuracy outside these assumptions is not guaranteed.

  • Training experiments cover only 8-bit and only MX formats.

  • Low-bit training evaluation is restricted to MXINT8 vs MXFP8 (no FP6/FP4 training comparisons in the provided content) (§5.3).
  • If the goal is to decide “FP4 vs INT4 for training,” this paper’s training section does not directly settle that.

  • Outlier mitigation dependence (Hadamard rotation) is a significant condition for NVINT4.

  • NVINT4 only becomes consistently better than NVFP4 after random Hadamard rotation (Table 3; Figure 4b).
  • This introduces algorithmic complexity and may have practical constraints not fully quantified here (e.g., where rotation is inserted, overheads, interactions with model architecture); the provided text evaluates the effect on κ/QSNR and inference metrics, but does not provide a full end-to-end cost accounting for rotation.

  • Hardware model scope is intentionally partial.

  • The cost model excludes the quantizer and ignores sequential elements, placement/routing, and interconnect (Sec. C).
  • It models combinational logic and assumes equal toggle rates (τ_g = τ) (Sec. C).
  • Therefore, the normalized area/energy results (Table 5) are best interpreted as relative trends under the model, not as full-chip power predictions.

  • Format configuration details are incomplete for full reproduction in the provided excerpt.

  • Training hyperparameters are given for model sizes (layers, hidden size, heads, batch size, LR schedule, optimizer settings, seq length) (Table 9).
  • However, some items required for a fully specified training recipe are not present in the provided content, including:
    • tokenizer type/vocabulary,
    • exact context window used during training beyond sequence length 2048 (Table 9),
    • exact hardware (GPU type/count), total compute budget, and wall-clock training time.
  • The paper points to Sec. D for more reproduction details and code link, but only portions are shown here.

  • Metric choice for inference focuses on distributional change rather than downstream task accuracy.

  • Direct-cast inference primarily uses KL divergence to BF16 outputs on WikiText2 (top-25 logits) (§5.2).
  • This is a valid “behavioral change” proxy (as motivated in §5.2), but it does not directly measure task accuracy across benchmarks for the quantized inference models in the provided main results.

7. Implications and Future Directions

  • Implication for hardware design: “FP everywhere” is not optimal under fine-grained quantization.
  • The paper’s combined results argue that when block-wise scaling is fine (e.g., block 32 MX), MXINT8 can be more accurate (Table 3; Table 12–15) and more hardware-efficient (Table 5) than MXFP8.
  • This directly challenges the trajectory described in the introduction where hardware shifts toward FP to handle outliers (§1; §7).

  • Implication for algorithm–hardware co-design: optimize for crest factor, not just bit-width.

  • The paper elevates crest factor κ (Eq. (11)) as a practical “format-selection statistic,” with explicit crossover thresholds (Figure 3; §4.2).
  • This suggests a design loop: 1) measure tensor crest factor distributions under your chosen granularity (Table 2; §5.1), 2) compare to theoretical crossovers (Figure 3), 3) decide FP vs INT and whether to add outlier mitigation (e.g., Hadamard rotation) to move κ into an INT-favorable region (Figure 4b; Table 3).

  • Practical application guidance (from the paper’s evidence)

  • If you want nearly lossless 8-bit block-wise quantized training: prefer MXINT8 with symmetric clipping, since it matches BF16 closely and slightly outperforms MXFP8 loss in the reported runs (Figure 5; Table 4; §3.2).
  • If you want 8-bit block-wise inference with MX formats: prefer MXINT8 over MXFP8 based on consistent wins across 12 models in their direct-cast setting (Table 3; Tables 12–15).
  • If you want 4-bit inference:

    • Without outlier suppression, FP4 variants (MXFP4, NVFP4) tend to win (Table 3).
    • With random Hadamard rotation, NVINT4 can surpass NVFP4 across all tested models (Table 3; Figure 4b), indicating that algorithmic outlier mitigation can flip the preferred format.
  • Follow-up research directions suggested by the presented results

  • Extend “nearly lossless” training comparisons beyond 8-bit:
    • The paper’s framework predicts why INT4 often loses without additional κ reduction (Figure 3; Table 2), but training experiments for FP4/INT4 are not provided here.
  • Jointly optimize:
    • block size (granularity),
    • outlier mitigation (e.g., rotations),
    • and format choice (INT vs FP) using measured κ distributions and QSNR-based predictions (Eq. (10)–(14); Figure 3–4).
  • Improve hardware cost accounting by integrating omitted components (quantizer, memory/interconnect, sequential timing) to validate whether the combinational MAC-centric advantage in Table 5 persists in full accelerator designs (Sec. C).

  • Repro/Integration Guidance (what the paper provides)

  • The paper provides:
    • a public code link (https://github.com/ChenMnZ/INT_vs_FP, shown in the abstract),
    • model IDs for inference evaluation (Table 8),
    • and detailed training hyperparameters for multiple Llama-3-style sizes (Table 9), including optimizer AdamW (β1=0.9, β2=0.95), weight decay 0.1, cosine LR schedule with warmup 500 steps, grad clip norm 1.0, sequence length 2048, and batch sizes (Table 9).
  • Missing from the provided content (and thus unclear here) are full system details like hardware used and compute budget, and some tokenizer/data processing specifics for training; those would matter for exact reproduction.