INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats¶
ArXiv: 2510.25602
🎯 Pitch¶
This paper delivers a unified, theoretical and empirical comparison of fine-grained low-bit integer (INT) vs. floating-point (FP) quantization across block-wise formats (MX, NV), revealing a crossover: while FP wins at coarse granularity, MXINT8 often outperforms MXFP8 in both accuracy and hardware efficiency at block sizes used in practice. It also proposes practical fixes—e.g., symmetric clipping for INT training and Hadamard rotation for outlier mitigation—enabling nearly lossless MXINT8 training and compelling algorithm–hardware guidance that challenges the current FP-centric accelerator trend.
1. Executive Summary (2-3 sentences)¶
This paper builds a unified, fine-grained comparison between low-bit integer (INT) and low-bit floating-point (FP) quantization formats for Large Language Models (LLMs), focusing on modern block-wise schemes like MX (block size 32) and NV (block size 16) (Table 1; §2.4). It identifies a “crossover” behavior: FP tends to be better under coarse granularity (large dynamic range needed for activation outliers), but at fine granularity INT—especially MXINT8—can be simultaneously more accurate and more hardware-efficient than FP alternatives (Figure 3; Table 3; Table 5). The paper also introduces a symmetric clipping fix for INT training that removes a gradient bias and enables nearly lossless MXINT8 training compared to BF16 (Figure 2; §3.2; Figure 5; Table 4; Table 10–11).
2. Context and Motivation¶
- Problem / gap addressed
- LLMs have large compute and memory demands, so quantization (storing/operating on fewer bits) is important for efficiency (§1).
- A central challenge is activation outliers: rare but very large values in activation distributions that expand the required dynamic range, hurting low-bit quantization (§1).
-
Industry hardware trends increasingly support low-bit FP formats (e.g., FP8/FP4) because FP’s exponent gives larger dynamic range, which helps with outliers (§1; §2.4).
-
What was missing
-
The paper argues there is no unified comparison of FP vs INT across quantization granularities (per-channel vs block-wise, etc.), which complicates algorithm–hardware co-design decisions (§1).
-
Why granularity matters
- Quantization granularity controls how scaling is applied:
Per-tensor: one scale for a whole tensor.Per-channel: one scale per channel.Block-k: one scale per contiguous block ofkvalues (§2.3).
-
Block-wise quantization is now common for outlier mitigation, so the FP-vs-INT trade-off should be re-evaluated under these fine-grained regimes (§1; §2.3–§2.4).
-
How the paper positions itself
- It frames FP vs INT as a function of:
1) granularity (block size),
2) bit-width (8/6/4-bit),
3) data distribution outlier-ness, summarized by crest factor
κ(Eq. (11)). - It introduces integer counterparts to hardware-relevant FP block formats so comparisons are “matched” (e.g.,
MXFP8vsMXINT8,NVFP4vsNVINT4) (Table 1; §2.4).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system studied is a set of low-bit number formats + scaling rules used to quantize tensors inside Transformer linear layers during inference and training.
- It solves the problem of deciding when INT or FP is better under block-wise quantization, using a theoretical metric (
QSNR) plus empirical measurements on real LLM tensors and downstream inference/training evaluations (Eq. (10); §4–§6).
3.2 Big-picture architecture (diagram in words)¶
- (A) Quantization definitions
- INT quantization with clipping and scaling (Eq. (1)).
- FP quantization via nearest representable FP value after scaling (Eq. (3)), with the FP codebook defined by exponent/mantissa fields (Eq. (2); §2.2).
- (B) Block-wise formats
MXformats: block size 32 with a shared per-blockUE8M0scale (Table 1; §2.4).NVformats: block size 16 with a per-blockE4M3scale and an additional per-tensor scale (“Scale-2”) (Table 1; §2.4).- (C) Evaluation pipeline
- A theoretical framework predicts quantization fidelity via
QSNRas a function of crest factor and format parameters (Theorems 1–2; Eq. (13)–(14); Figure 3). - Empirical tensor capture + QSNR measurement validates theory on real tensors (Figure 1; §5.1; Table 2; Figure 4).
- Downstream tests: direct-cast inference behavior change via KL divergence, and low-bit training loss/accuracy vs BF16 (Table 3; Table 12–15; Figure 5; Table 4).
- (D) Hardware model
- A gate-level cost model compares normalized area/energy for MAC-centric MMUs under INT vs FP formats (Sec. C; Table 5; Table 6–7).
3.3 Roadmap for the deep dive¶
- First, define the quantization operations and the role of scales and granularity (§2–§3).
- Next, define the fidelity metric QSNR and the key distribution summary crest factor
κ(Eq. (10)–(11)). - Then, explain the two QSNR theorems—why INT depends strongly on
κand bit-width, and why FP is bounded by mantissa bits and subnormals (Theorems 1–2; Eq. (13)–(14)). - After that, connect theory to measured real-tensor crest factors and QSNR (Table 2; Figure 4).
- Finally, cover the downstream inference/training results and hardware cost trade-offs (Table 3–5; Figure 5; Sec. C).
3.4 Detailed, sentence-based technical breakdown¶
- Paper type and core idea.
-
This is a combined theoretical + empirical systems paper: it derives analytical approximations for quantization error (via
QSNR) and validates them with real LLM tensor statistics, then links those findings to inference, training, and hardware cost comparisons (Sec. 4–6). -
Quantization: what happens mathematically.
- For INT quantization, the paper uses a standard “scale → round → clip → dequantize” mapping:
- A high-precision tensor
Xis scaled bys, rounded to the nearest integer, clipped into the representable integer range, then scaled back (Eq. (1); §2.1).
- A high-precision tensor
- For FP quantization, the tensor is similarly scaled, but the normalized value is mapped to the nearest representable low-bit floating-point value (
CFP) defined by exponent/mantissa rules, then scaled back (Eq. (2)–(3); §2.2). -
The paper emphasizes that both INT and FP quantizers can be written in a shared form (Eq. (3)) by swapping the “codebook” (
CFPvs an integer codebook). -
Granularity and why block-wise matters.
- In
block-kquantization, the tensor is partitioned into blocks (e.g., length 32 for MX), and each block gets its own scale factor (§2.3). -
The key intuition the paper uses is: smaller blocks reduce the local dynamic range, which can reduce outlier impact within each block and change whether FP’s exponent advantage is still necessary (§1; §4.2).
-
Block formats compared (what exactly are MX vs NV).
MXformats (from OCP microscaling) use:- Block size
32. - A shared per-block
UE8M0scale factor (8-bit unsigned float with exponent bits only) (Table 1; §2.4; footnote on UE8M0). - The data in the block is stored in a low-bit type like
E4M3(MXFP8),E2M3(MXFP6), orE2M1(MXFP4) (Table 1; §2.4).
- Block size
-
NVFP4modifies MXFP4 by:- Using block size
16. - Using an
E4M3scale and adding a second-level per-tensor scale to avoid overflow (Table 1; §2.4).
- Using block size
-
Scale factor computation and its overhead.
- Scales start from an
AbsMaxstrategy:- Compute
s = AbsMax(X)/QmaxwhereAbsMax(X)is the maximum absolute value in the sharing group andQmaxis the maximum representable value of the quantized type (Eq. (7); §3.2).
- Compute
- For MX, scales are quantized into
UE8M0. The paper notes that rounding the scale down can introduce extra clipping error (Eq. (8); §3.2), so it uses rounding up:s′ = 2^{clip(ceil(log2(s)), -127, 127)}(Eq. (9); §3.2).
-
In the theory, this “scale quantization” is abstracted as an overhead factor
ρwheres′ = ρ s(Eq. (12)), withρ ∈ [1,2)for UE8M0, and approximatelyρ = 1for higher-precision scales like E4M3 (Sec. 4.1 assumptions). -
Training/inference compute flow: where quantization is applied.
- The paper explicitly models quantization inside a linear layer’s GEMMs for forward and backward propagation (Figure 1; §3.1).
- Forward pass:
- Quantize activations
Xand weightsW, then GEMM to produceY(Eq. (4); Figure 1 marks this as operations ① and ②).
- Quantize activations
- Backward pass:
- Compute
dXfrom quantizeddYand quantizedW^T(Eq. (5); Figure 1 operations ③ and ④). - Compute
dWfrom quantizedX^Tand quantizeddY^T(Eq. (6); Figure 1 operations ⑤ and ⑥).
- Compute
-
A practical hardware constraint is noted: block-wise quantization should align with the GEMM reduction dimension to gain hardware benefits (§3.1).
-
Symmetric clipping to fix INT training bias (paper’s algorithmic fix).
- Two’s complement signed INT ranges are asymmetric: for b-bit signed INT, there is one extra negative value (e.g.,
[-128, 127]for INT8) (§3.2). - The paper finds this usually does not harm inference but degrades INT training due to a persistent negative gradient bias, especially for finer-grained quantization where more blocks exist and more values can hit the unique negative endpoint (Figure 2; §3.2).
- The proposed fix is to enforce a symmetric INT range:
Qmin = -(2^{b-1} - 1),Qmax = 2^{b-1} - 1(stated in §3.2; also reflected in Table 1’s “MXINT8 127” range and the mention of[-127, 127]in Figure 2 legend).
-
The appendix provides an ablation showing training loss differences under
[-128,127]vs[-127,127]across block sizes and scale types (Table 10), and attributes part of the problem to limited precision in BF16 scale computations (Algorithm 1; Table 11; §D.2). -
Theoretical framework: QSNR and crest factor.
- The paper uses
QSNR(Quantization Signal-to-Noise Ratio) to quantify numerical fidelity:QSNR = -10 log10( ||X - Xq||^2 / ||X||^2 )in dB (Eq. (10); §4.1).- Higher QSNR means less quantization noise relative to signal energy.
-
It models per-block data
X ∈ R^kwith i.i.d. Gaussian entries and summarizes outlier strength using the crest factor:κ = max(|X|)/σwhereσis the block RMS (Eq. (11); §4.1).- Intuitively, larger
κmeans “more extreme outliers relative to typical scale.”
-
Theorem 1: INT QSNR scaling law (what it says and why).
- Theorem 1 approximates INT QSNR as (Eq. (13)):
- For UE8M0-scaled MX-like settings:
QSNRINT ≈ 4.78 + 6.02 b − 20 log10(ρ) − 20 log10(κ). - For E4M3-scaled NV-like settings: similar but without
ρ, plus a block-size factor10 log10(g/(g−1))(Eq. (13)).
- For UE8M0-scaled MX-like settings:
-
Mechanistic interpretation given in §4.1:
- Each additional bit gives ~
6.02 dB(because uniform quantization error shrinks exponentially with bit-width). - Scale rounding overhead
ρcan cost up to6.02 dB. - Larger crest factor
κreduces QSNR; smaller blocks tend to reduceκ.
- Each additional bit gives ~
-
Theorem 2: FP QSNR structure (mantissa-bound + subnormal effects).
- Theorem 2 approximates FP QSNR as a function of:
- A mantissa-resolution term
α_M(depends on mantissa bit-widthM), - Energy fraction in the normal region
w_norm, - Probability of being subnormal
p_sub, - Crest factor
κ, scale overheadρ, and block sizeg(Eq. (14); §4.1).
- A mantissa-resolution term
-
Key interpretation (§4.1):
- With enough dynamic range so almost everything is “normal” (i.e.,
w_norm ≈ 1andp_sub ≈ 0), FP QSNR is bounded by mantissa bits: QSNR ≈ 13.80 + 6.02 MdB, largely independent of block size and distribution.- As crest factor grows, more values fall into subnormals and QSNR drops.
- Finer blocks help by reducing
κ(reducing subnormals) but FP still faces a mantissa-limited ceiling.
- With enough dynamic range so almost everything is “normal” (i.e.,
-
Crossover prediction and its operational meaning.
- Combining Theorem 1 and 2, the paper plots theoretical QSNR vs crest factor for multiple formats and identifies crossover crest factors where INT becomes better than FP (Figure 3; §4.2).
- Reported crossover points in §4.2 / Figure 3 include:
MXINT8beatsMXFP8whenκ < 7.55.MXINT6beatsMXFP6only whenκ < 1.96.MXINT4beatsMXFP4only whenκ < 2.04.NVINT4beatsNVFP4whenκ < 2.39.
-
This sets up the empirical strategy: measure real-tensor
κunder different block sizes and see which side of the crossover typical tensors fall on (§5.1; Table 2). -
Worked micro-example (single block) to illustrate the mechanism (constructed from the paper’s definitions).
- Suppose you quantize a block of
g = 32values withAbsMaxscaling (Eq. (7)) using an INT4 codebook[-7, 7](Table 1). - Step 1: compute
AbsMax(X)over the 32 values, thens = AbsMax(X)/7(Eq. (7)). - Step 2: (MX case) quantize the scale to a power-of-two
s′via UE8M0 rounding-up (Eq. (9)), which introduces overheadρwheres′ = ρ s(Eq. (12)). - Step 3: for each value
xin the block, computeq = round(x/s′), clampqinto[-7,7], and dequantizex_q = q s′(Eq. (1) / Eq. (19) in Appendix B.2). - If the block has a large outlier (large
κ),AbsMax(X)grows,s′grows, and the step size grows—so most non-outlier values get coarsely quantized, lowering QSNR (captured by the-20 log10(κ)term in Eq. (13)). - If you reduce block size (e.g., from per-channel to 32 or 16), the maximum within each block tends to be smaller relative to RMS (lower
κ), so INT quantization becomes more competitive (Table 2; §5.1).
4. Key Insights and Innovations¶
- 1) “Crossover” framing: INT vs FP depends on crest factor and granularity, not just “FP is better for outliers.”
- Novelty: The paper ties FP-vs-INT comparison to a single interpretable quantity, the crest factor
κ, and shows how smaller blocks reduceκ, shifting the trade-off (Eq. (11); Figure 3; Table 2). -
Significance: This provides a concrete design knob (block size) to decide whether INT’s uniform precision beats FP’s dynamic range for a given workload (§1; §4.2; §5.1).
-
2) Theoretical QSNR framework for both INT and FP under block-wise scaling.
- Novelty: The paper provides explicit approximations for QSNR under INT (Theorem 1; Eq. (13)) and FP (Theorem 2; Eq. (14)) that incorporate:
- bit-width
b(INT), - mantissa bits
M(FP), - scale overhead
ρ, - block size
g, - crest factor
κ.
- bit-width
-
Significance: This enables format-level predictions like “MXINT8 should beat MXFP8 for most blocks if κ is typically below 7.55” (Figure 3; §4.2), which the paper later corroborates empirically (Figure 4; Table 3).
-
3) Empirical finding:
MXINT8dominatesMXFP8in fine-grained settings for both inference and training. - Inference:
MXINT8wins overMXFP8on all 12 evaluated models in direct-cast inference, with or without random Hadamard rotation (Table 3; per-model KL in Table 12–13; perplexity in Table 14–15). -
Training:
MXINT8achieves nearly BF16-matching loss and task accuracy, and slightly lower loss thanMXFP8in their runs (Figure 5; Table 4). -
4) Algorithmic training fix: symmetric clipping to remove INT gradient bias under fine granularity.
- Novelty: The paper isolates an INT-training-specific failure mode from two’s-complement asymmetry and BF16 scale precision, and fixes it by enforcing
[-(2^{b-1}-1), 2^{b-1}-1]instead of the default INT range (§3.2; Figure 2; Table 10–11). -
Significance: This turns INT8 training from “degraded” to “nearly lossless” in their reported experiments (Figure 5; Table 4; Table 10).
-
5) Hardware cost analysis suggesting INT block formats are materially cheaper at matched throughput.
- Novelty: A gate-level MAC-centric model (Sec. C) is used to compare normalized energy/area across formats, including mixed-format chips with a 1:2 throughput ratio for 8-bit vs 4-bit (Table 5; Table 7).
- Significance: Reported normalized costs show substantial energy/area advantages for INT pipelines (Table 5), challenging an FP-only hardware trajectory (§6; §7).
5. Experimental Analysis¶
Evaluation methodology¶
- Tensor-wise QSNR & crest factor measurement (§5.1)
- Model:
Llama-3.1-8B. - Data: 8
WikiText2sequences, length4096(§5.1). -
Procedure:
- Run forward+backward in BF16 and capture six intermediate tensors per linear layer corresponding to Figure 1’s quantization points ①–⑥ (Eq. (4)–(6)).
- Across 224 linear layers × 6 tensors =
10,752tensors (§5.1). - Compute block-wise crest factor
κand tensor-wise QSNR under different formats and block sizes. - Also test random
Hadamard rotation(dimension 32×32) as an outlier suppression technique (§5.1).
-
Direct-cast inference (§5.2)
- Formats compared:
MXFP8/MXINT8,MXFP6/MXINT6,MXFP4/MXINT4,NVFP4/NVINT4(Table 1; §5.2). - Models: 12 LLMs, spanning dense and MoE, from
0.6Bup to235Bparameters (§5.2; Table 8 lists Hugging Face IDs). - Quantization scope: quantize all forward GEMMs (“direct-cast inference from a pretrained BF16 model”) (§5.2).
- Metric:
- KL divergence on WikiText2 between quantized model output distribution and BF16 counterpart, computed on the top-25 BF16 logits (to reduce noise), averaged over tokens and scaled by
10^6in tables (Table 12–13; §5.2).
- KL divergence on WikiText2 between quantized model output distribution and BF16 counterpart, computed on the top-25 BF16 logits (to reduce noise), averaged over tokens and scaled by
-
Also evaluated with random Hadamard rotation by quantizing
XRandR^T Wwithhequal to block size (32 for MX, 16 for NV) (§5.2). -
Low-bit training (§5.3)
- Focus: only 8-bit “nearly lossless” training; compare
MXINT8vsMXFP8because prior work already shows near-lossless FP8 training (as referenced by the paper) (§5.3). - Models and data:
- 1B and 3B Llama-3-style models (§D.1), trained on
OLMo2-Mix-1124dataset with100Band200Btokens (§5.3).
- 1B and 3B Llama-3-style models (§D.1), trained on
- Metrics:
- Training loss curves (EMA smoothing coefficient
0.9) (Figure 5; §5.3). - 5-shot evaluation accuracy via
lm_evalon six tasks (Arc_Easy, Arc_Challenge, HellaSwag, OpenbookQA, PIQA, WinoGrande) (Table 4; §5.3).
- Training loss curves (EMA smoothing coefficient
Main quantitative results (with numbers)¶
- Crest factor statistics show granularity shifts κ into INT-favorable regimes (§5.1).
- Table 2 (Q3 = 75% quantile) shows:
- Per-channel (block size “-1”): Q3 crest factor
11.97(very high, above many crossovers). - Block 32: Q3
2.96. - Block 16: Q3
2.39. - With Hadamard rotation, Q3 drops further (e.g., block 16 Q3
2.11) (Table 2).
- Per-channel (block size “-1”): Q3 crest factor
-
The paper interprets this against Figure 3 crossovers:
2.96 < 7.55⇒ MXINT8 should beat MXFP8 “in most cases.”- But
2.96 > 1.96and> 2.04⇒ MXINT6 and MXINT4 should lose vs FP counterparts (Table 2; §5.1; §4.2).
-
Measured QSNR over 10,752 real tensors matches theoretical predictions (Figure 4; §5.1).
- Without rotation (Figure 4a):
MXINT8average QSNR40.35 dBvsMXFP831.50 dB; INT win-rate100%for this dataset of tensors (Figure 4a).MXINT6average28.57 dBvsMXFP630.39 dB; FP wins99.9%(Figure 4a).MXINT4average15.89 dBvsMXFP417.98 dB; FP wins99.7%(Figure 4a).NVINT4average20.55 dBvsNVFP420.60 dB; INT win-rate64.3%but slightly lower average (Figure 4a).
-
With Hadamard rotation (Figure 4b):
NVINT4average21.65 dBvsNVFP420.35 dB; INT wins99.3%(Figure 4b).- The paper notes
NVFP4QSNR can decrease when κ decreases below ~4 due to its normal vs subnormal error trade-off (consistent with Figure 3 and Eq. (14)) (§5.1).
-
Direct-cast inference: consistent wins/losses by bit-width and rotation (Table 3; Table 12–13).
- Summary win counts across 12 models (Table 3):
MXINT8vsMXFP8: INT wins12/12(both original and with rotation).MXINT6vsMXFP6: FP wins12/12originally; with rotation INT wins1/12.MXINT4vsMXFP4: FP wins12/12(original and with rotation).NVINT4vsNVFP4: FP wins12/12originally; with rotation INT wins12/12.
- Per-model KL divergence tables (Table 12–13) concretely reflect these patterns; e.g., for Qwen3-8B:
- Original:
NVINT4 5120vsNVFP4 3430(FP better). - With rotation:
NVINT4 4026vsNVFP4 5912(INT better) (Table 12).
- Original:
-
Perplexity on WikiText2 (Table 14–15) is broadly consistent with KL trends, and shows MXINT8 is very close to BF16 perplexity in many cases (e.g., Qwen3-8B BF16
6.5135vs MXINT86.5174) (Table 14). -
Low-bit training: MXINT8 is nearly lossless and comparable to BF16, slightly better loss than MXFP8 (Figure 5; Table 4).
- Loss curves (Figure 5): BF16, MXFP8, and MXINT8 curves “almost overlap,” with MXINT8 lower by ~
0.001loss in the zoomed view (§5.3). -
Final reported losses and task accuracies (Table 4):
- 1B / 100B tokens: BF16 loss
2.6727, MXFP82.6767, MXINT82.6758. - 3B / 200B tokens: BF16 loss
2.4794, MXFP82.4821, MXINT82.4812. - Average task accuracy: 1B BF16
56.89, MXFP856.86, MXINT857.02; 3B BF1664.45, MXFP864.05, MXINT864.30(Table 4).
- 1B / 100B tokens: BF16 loss
-
Symmetric clipping ablation: asymmetric INT8 range hurts training (Figure 2; Table 10; §D.2).
- Figure 2 shows higher training loss when using
[-128,127]compared to[-127,127]on a145Mmodel trained with20Btokens, particularly as granularity becomes finer (block size 32). - Table 10 quantifies this across granularities and scale types:
- For BF16 scale at block 32:
[-128,127]loss3.1354vs[-127,127]3.1251. - Similar but smaller effects under UE8M0 scales (Table 10).
- For BF16 scale at block 32:
-
Algorithm 1 / Table 11 further support that BF16 arithmetic can cause unintended mapping to
-128even when scaling is “theoretically” set to avoid it (Table 11 reports16.82%ratio for BF16 vs0for FP32). -
Hardware cost analysis: normalized energy/area favors INT (Sec. 6; Table 5).
- At matched throughput, normalized costs (Table 5):
- Single-format:
MXINT8energy0.63×and area0.79×relative toMXFP8baseline;NVINT4energy0.34×and area0.38×relative toMXFP8baseline (Table 5). - Mixed-format (8-bit + 4-bit at throughput ratio 1:2):
MXINT8+NVINT4energy0.75×and area0.66×relative toMXFP8+NVFP4baseline (Table 5).
- Single-format:
- The model scope explicitly excludes quantizer cost and focuses on MAC + dequantizer + FP32 accumulator (Sec. C), so the conclusions are “within that model.”
Do the experiments support the claims?¶
- The paper’s central empirical claims are strongly aligned across:
- theory (Figure 3),
- real-tensor crest factor distributions (Table 2),
- real-tensor QSNR measurements (Figure 4),
- model-level inference metrics across 12 LLMs (Table 3; Table 12–15),
- and low-bit training comparisons for 8-bit (Figure 5; Table 4),
- plus a hardware cost model (Table 5; Sec. C).
- The most conditional claim is the 4-bit NV result:
- Without rotation,
NVFP4wins12/12; with Hadamard rotation,NVINT4wins12/12(Table 3). - This is consistent with the paper’s framing that format choice depends on outlier mitigation changing crest factor (
κ) (Table 2; Figure 4b).
6. Limitations and Trade-offs¶
- Theory relies on stylized distribution assumptions.
- Theorem derivations assume i.i.d. Gaussian block entries (
Xi ~ N(0, σ^2)) (§4.1; Appendix B.1). -
Real LLM tensors may deviate (heavy tails, correlations, structured outliers). The paper partially addresses this by measuring real tensors (Table 2; Figure 4), but the theoretical accuracy outside these assumptions is not guaranteed.
-
Training experiments cover only 8-bit and only MX formats.
- Low-bit training evaluation is restricted to
MXINT8vsMXFP8(no FP6/FP4 training comparisons in the provided content) (§5.3). -
If the goal is to decide “FP4 vs INT4 for training,” this paper’s training section does not directly settle that.
-
Outlier mitigation dependence (Hadamard rotation) is a significant condition for NVINT4.
NVINT4only becomes consistently better thanNVFP4after random Hadamard rotation (Table 3; Figure 4b).-
This introduces algorithmic complexity and may have practical constraints not fully quantified here (e.g., where rotation is inserted, overheads, interactions with model architecture); the provided text evaluates the effect on κ/QSNR and inference metrics, but does not provide a full end-to-end cost accounting for rotation.
-
Hardware model scope is intentionally partial.
- The cost model excludes the quantizer and ignores sequential elements, placement/routing, and interconnect (Sec. C).
- It models combinational logic and assumes equal toggle rates (
τ_g = τ) (Sec. C). -
Therefore, the normalized area/energy results (Table 5) are best interpreted as relative trends under the model, not as full-chip power predictions.
-
Format configuration details are incomplete for full reproduction in the provided excerpt.
- Training hyperparameters are given for model sizes (layers, hidden size, heads, batch size, LR schedule, optimizer settings, seq length) (Table 9).
- However, some items required for a fully specified training recipe are not present in the provided content, including:
- tokenizer type/vocabulary,
- exact context window used during training beyond sequence length
2048(Table 9), - exact hardware (GPU type/count), total compute budget, and wall-clock training time.
-
The paper points to Sec. D for more reproduction details and code link, but only portions are shown here.
-
Metric choice for inference focuses on distributional change rather than downstream task accuracy.
- Direct-cast inference primarily uses KL divergence to BF16 outputs on WikiText2 (top-25 logits) (§5.2).
- This is a valid “behavioral change” proxy (as motivated in §5.2), but it does not directly measure task accuracy across benchmarks for the quantized inference models in the provided main results.
7. Implications and Future Directions¶
- Implication for hardware design: “FP everywhere” is not optimal under fine-grained quantization.
- The paper’s combined results argue that when block-wise scaling is fine (e.g., block 32 MX),
MXINT8can be more accurate (Table 3; Table 12–15) and more hardware-efficient (Table 5) thanMXFP8. -
This directly challenges the trajectory described in the introduction where hardware shifts toward FP to handle outliers (§1; §7).
-
Implication for algorithm–hardware co-design: optimize for crest factor, not just bit-width.
- The paper elevates crest factor
κ(Eq. (11)) as a practical “format-selection statistic,” with explicit crossover thresholds (Figure 3; §4.2). -
This suggests a design loop: 1) measure tensor crest factor distributions under your chosen granularity (Table 2; §5.1), 2) compare to theoretical crossovers (Figure 3), 3) decide FP vs INT and whether to add outlier mitigation (e.g., Hadamard rotation) to move κ into an INT-favorable region (Figure 4b; Table 3).
-
Practical application guidance (from the paper’s evidence)
- If you want nearly lossless 8-bit block-wise quantized training: prefer
MXINT8with symmetric clipping, since it matches BF16 closely and slightly outperforms MXFP8 loss in the reported runs (Figure 5; Table 4; §3.2). - If you want 8-bit block-wise inference with MX formats: prefer
MXINT8overMXFP8based on consistent wins across 12 models in their direct-cast setting (Table 3; Tables 12–15). -
If you want 4-bit inference:
- Without outlier suppression, FP4 variants (
MXFP4,NVFP4) tend to win (Table 3). - With random Hadamard rotation,
NVINT4can surpassNVFP4across all tested models (Table 3; Figure 4b), indicating that algorithmic outlier mitigation can flip the preferred format.
- Without outlier suppression, FP4 variants (
-
Follow-up research directions suggested by the presented results
- Extend “nearly lossless” training comparisons beyond 8-bit:
- The paper’s framework predicts why INT4 often loses without additional κ reduction (Figure 3; Table 2), but training experiments for FP4/INT4 are not provided here.
- Jointly optimize:
- block size (granularity),
- outlier mitigation (e.g., rotations),
- and format choice (INT vs FP) using measured κ distributions and QSNR-based predictions (Eq. (10)–(14); Figure 3–4).
-
Improve hardware cost accounting by integrating omitted components (quantizer, memory/interconnect, sequential timing) to validate whether the combinational MAC-centric advantage in Table 5 persists in full accelerator designs (Sec. C).
-
Repro/Integration Guidance (what the paper provides)
- The paper provides:
- a public code link (
https://github.com/ChenMnZ/INT_vs_FP, shown in the abstract), - model IDs for inference evaluation (Table 8),
- and detailed training hyperparameters for multiple Llama-3-style sizes (Table 9), including optimizer
AdamW (β1=0.9, β2=0.95), weight decay0.1, cosine LR schedule with warmup500steps, grad clip norm1.0, sequence length2048, and batch sizes (Table 9).
- a public code link (
- Missing from the provided content (and thus unclear here) are full system details like hardware used and compute budget, and some tokenizer/data processing specifics for training; those would matter for exact reproduction.