LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale¶
ArXiv: 2208.07339
🎯 Pitch¶
The paper introduces LLM.int8(), a post-training method combining vector-wise Int8 quantization with a mixed-precision decomposition that isolates rare high-magnitude hidden-feature outliers into 16-bit operations, enabling nearly all matrix multiplications in transformers to run in 8-bit without accuracy loss. By cutting weight memory roughly in half (e.g., ~1.96× for BLOOM-176B) and preserving full-precision performance up to 175B parameters, it makes very large LLMs feasible to load and serve on far more modest GPU hardware.
1. Executive Summary (2-3 sentences)¶
LLM.int8() is a post-training inference method that converts the large parameter-heavy matrix multiplications in transformer feed-forward and attention projection layers from 16/32-bit weights to Int8 while preserving full-precision task performance at very large scales (up to 175B parameters). The key significance is that it roughly halves inference-time model weight memory (e.g., 1.96× footprint reduction for BLOOM-176B) without needing fine-tuning, making previously impractical model sizes feasible on more modest GPU setups (Abstract; Section 3; Appendix A).
2. Context and Motivation¶
- Problem / gap.
- Large transformer language models are widely used but are extremely memory-hungry at inference time, largely due to the weight matrices in
feed-forwardandattention projectionlinear layers (Introduction). -
Prior
8-bit quantizationmethods reduce memory but (i) usually degrade accuracy, (ii) often require post-training tuning, and (iii) were mainly studied at smaller scales (≤350Mparameters) (Introduction). -
Why it matters.
-
Memory is often the binding constraint for serving or even loading large models. Cutting weight memory by ~2× can change feasibility: e.g., enabling models like
OPT-175B/BLOOM-176Bon “consumer GPU” multi-GPU servers (Abstract; Table 2 / Appendix A Table 3). -
Where prior approaches fall short (as characterized in this paper).
- Standard tensor-wise
absmaxquantization uses a single scale per tensor; one extreme value (“outlier”) can force a large scale that collapses precision for the majority of values (Section 3). -
Even improved schemes like
row-wiseor the paper’s ownvector-wisequantization eventually fail once systematic outlier hidden-state feature dimensions emerge broadly in large models (Figure 1; Section 4). -
How the paper positions itself.
- The paper frames its main contribution as the first “multi-billion-scale Int8 quantization procedure for transformers that does not incur any performance degradation,” demonstrated up to
175Bparameters (Introduction; Figure 1). - The core positioning is: understanding and explicitly handling emergent outlier features is necessary for scaling Int8 inference without degradation (Sections 3–4).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is an inference-time quantized matrix multiplication procedure for transformer linear layers, designed to replace most FP16/FP32 multiplications with
Int8multiplications. - It solves the “Int8 breaks at scale” problem by combining (i) finer-grained scaling (
vector-wise quantization) with (ii) a targeted fallback toFP16compute for a tiny subset of “outlier” feature dimensions (mixed-precision decomposition) (Section 3; Figure 2).
3.2 Big-picture architecture (diagram in words)¶
- Inputs: hidden-state activations
X_f16(FP16) and a layer weight matrixW_f16(FP16/FP32 checkpoint loaded then converted) (Abstract; Figure 2). - Component A — Outlier detection & split: identify feature dimensions with magnitude above a threshold
αand splitX/Winto outlier vs non-outlier sub-matrices (Section 3.2; Eq. (8); Figure 2). - Component B — Int8 path (majority): quantize the non-outlier parts using
vector-wise absmaxscaling (row scales forX, column scales forW), performInt8×Int8 → Int32GEMM, then dequantize back to FP16 (Section 3.1; Eq. (7); Figure 2). - Component C — FP16 path (tiny minority): multiply the outlier sub-matrices with FP16 GEMM (Section 3.2; Eq. (8); Figure 2).
- Output combine: accumulate Int8-path output + FP16-outlier output in FP16, producing the layer output (Figure 2).
3.3 Roadmap for the deep dive¶
- I will explain (1) why naïve Int8 scaling fails at large model sizes, because that motivates the entire design (Section 3; Section 4).
- Then (2) I’ll detail
absmaxandzeropointquantization and how they plug into matmul (Section 2.1; Eq. (1)–(6)). - Next (3) I’ll cover
vector-wise quantizationas the “precision within Int8” improvement (Section 3.1; Eq. (7)). - Finally (4) I’ll explain
mixed-precision decomposition, the core trick for handling emergent outlier feature dimensions, and how it composes with vector-wise quantization intoLLM.int8()(Section 3.2; Eq. (8); Figure 2; Section 4).
3.4 Detailed, sentence-based technical breakdown¶
Framing (one sentence).
This is an empirical systems-and-algorithms paper that introduces an inference-time Int8 matrix multiplication scheme for transformers, with the core idea of doing almost all multiplies in Int8 while surgically keeping a tiny set of critical outlier feature dimensions in FP16 (Abstract; Section 3; Figure 2).
3.4.1 Why standard Int8 quantization fails at scale (the motivating mechanism)¶
- In common
absmaxquantization, an FP16 tensor is scaled into[-127, 127]using a single scale derived from the tensor’s maximum absolute value (Section 2.1). Concretely, forX_f16, the paper defines:X_i8 = round(127 * X_f16 / max_ij |X_f16_ij|)(Section 2.1) - This “one scale per tensor” approach is brittle because if even one entry is extremely large, then the scale is dominated by that outlier, causing most typical-magnitude values to map to a narrow band of integers (often to 0 after rounding), destroying information (Section 3; Section 4.2 point (3)).
- The paper’s key observation is that large transformers develop systematic outlier feature dimensions in hidden states: a small number of feature dimensions (columns of hidden states) have values with magnitudes up to 20× larger than other dimensions, and these outliers become ubiquitous across layers and tokens at around the
6.7Bscale (Introduction; Section 4; Figure 3; Figure 4; Table 4). - These outliers are not harmless noise: the paper reports that zeroing these outlier feature dimensions reduces top-1 attention softmax probability mass by >20% and increases validation perplexity by 600–1000%, despite being only ~
0.1%of features; removing the same number of random features has tiny impact (Introduction; Section 4.2).
This sets up two requirements:
1. Improve precision for “normal” values under Int8 quantization, and
2. Preserve high precision for the small set of outlier feature dimensions that dominate performance.
3.4.2 Quantization primitives used (absmax and zeropoint)¶
Absmax quantization (symmetric).
- The scale is s_x = 127 / ||X||_∞, and quantization is X_i8 = round(s_x * X_f16) (Section 2.1).
- Symmetry means it dedicates equal integer range to positive and negative values; if the true distribution is one-sided, half the representable range is wasted, increasing error (Section 2.1).
Zeropoint quantization (asymmetric affine).
- The paper defines a normalized dynamic range nd_x and a zeropoint shift zp_x (Eq. (1)–(3) in Section 2.1), enabling the full [-127, 127] integer range to cover an asymmetric distribution.
- For computation, the paper notes a practical complication: some hardware provides a special instruction to handle zeropoints efficiently, but on GPUs/TPUs you may need to “unroll” the computation into extra terms (Eq. (4)–(5)), which adds overhead (Section 2.1).
How quantized matmul is used in this paper’s setting.
- For hidden states X_f16 ∈ R^{s×h} and weights W_f16 ∈ R^{h×o}, they approximate:
X_f16 W_f16 ≈ S_f16 · (Q(X_f16) Q(W_f16))
where the quantized product accumulates intoInt32and is then rescaled/dequantized back toFP16(Eq. (6) in Section 2.1).
3.4.3 Vector-wise quantization (the “precision for most values” part)¶
Core idea.
- Instead of one scale for the entire tensor, treat a matrix multiplication as many independent inner products between rows of X and columns of W, and give each inner product its own effective scaling (Section 3.1).
Mechanism in the paper’s notation.
- Assign:
- a separate scale c_x per row of X_f16 (so c_x ∈ R^{s}), and
- a separate scale c_w per column of W_f16 (so c_w ∈ R^{o}).
- Then dequantization is done by dividing each output element by the corresponding product of scales. The paper writes the matrix form as de-normalization by the outer product:
dequantize by
1 / (c_x ⊗ c_w)(Section 3.1; Eq. (7))
Why this helps (and what it does not fix).
- It confines scale sensitivity: if one row of X has an outlier, it mainly harms that row’s scaling rather than the entire tensor; similarly for a column of W (Section 3.1).
- However, the paper argues this still fails once outliers are feature-wise (column-wise in hidden states) and systematic across many rows/tokens, because row scaling cannot isolate a column-specific phenomenon (Section 3.2; Section 4.3).
A worked micro-example (illustrative, matching the paper’s mechanism).
Assume one row vector x has typical values around 0.2, but one feature dimension spikes to 12.0. With absmax-style scaling, the row scale becomes c_x = max(|x|) = 12.0, so all the 0.2 entries get scaled by 127/12 ≈ 10.58, becoming about 2 after rounding—very low resolution. If instead this same outlier were rare and isolated, vector-wise would prevent it from affecting other rows/columns, but when the outlier is a systematic feature dimension across many rows, many rows inherit similarly bad scaling, which motivates the paper’s second component (Sections 3.1–3.2; Section 4.3).
3.4.4 Mixed-precision decomposition (the “handle outlier feature dimensions” part)¶
Core idea.
- Identify a small set of hidden-state feature dimensions that contain outliers above a threshold α, and compute those dimensions’ contributions in FP16, while computing everything else in Int8 (Section 3.2; Figure 2).
Outlier set definition. - The paper defines the outlier index set:
O = { i | 0 ≤ i ≤ h and feature i has at least one outlier magnitude > α }(Section 3.2) - The paper reports thatα = 6.0is sufficient to reduce degradation “close to zero” (Section 3.2), and later uses magnitude ≥ 6 as part of the outlier-feature analysis criteria (Section 4.1).
Decomposed matmul equation. - Using Einstein notation, the paper defines the mixed-precision decomposition as:
C_f16 ≈ Σ_{h ∈ O} X^h_f16 W^h_f16 + S_f16 · Σ_{h ∉ O} X^h_i8 W^h_i8(Eq. (8) in Section 3.2) - Interpretation in plain language: - First compute the contribution of outlier feature dimensions exactly (FP16 matmul restricted to those dimensions). - Then compute the contribution of all remaining dimensions using vector-wise quantized Int8 matmul, dequantize, and add it.
Why this is computationally and memory efficient (as argued in the paper).
- The paper reports that outlier feature dimensions are extremely few: for transformers up to 13B, |O| ≤ 7 (Section 3.2), and outliers are ~0.1% of feature dimensions (Introduction; Section 3.2).
- Because only a tiny slice is FP16, “more than 99.9% of values are multiplied in 8-bit,” while still protecting the critical dimensions (Abstract; Section 3.2).
3.4.5 End-to-end “pipeline diagram in words” (explicit sequence)¶
For a transformer linear layer (FFN or attention projection) at inference time, LLM.int8() proceeds as follows (Figure 2; Section 3):
1. Start with FP16 inputs and weights: take hidden states X_f16 and weights W_f16 from a loaded checkpoint (Abstract; Figure 2).
2. Detect outlier feature dimensions: find feature indices whose magnitude exceeds threshold α (paper uses α = 6.0) to form set O (Section 3.2; Section 4.1).
3. Split matrices by feature dimension: separate X and W into (a) outlier-feature sub-matrices (columns/rows indexed by O) and (b) the remaining “regular” part (Figure 2; Eq. (8)).
4. Compute the outlier contribution in FP16: perform FP16 matmul on the outlier slices to obtain Out_outlier_f16 (Figure 2; Section 3.2).
5. Compute scales for vector-wise quantization: compute per-row scales c_x (for rows of X) and per-column scales c_w (for columns of W) for the regular parts (Figure 2; Section 3.1; Eq. (7)).
6. Quantize regular parts to Int8: scale each row/column into [-127, 127] and round to Int8 (Figure 2; Section 2.1 + Section 3.1).
7. Int8 matmul and Int32 accumulation: run Int8×Int8 → Int32 GEMM to produce Out_i32 (Figure 2; Section 2.1).
8. Dequantize Int32 output back to FP16: rescale using the outer product c_x ⊗ c_w (Figure 2; Section 3.1; Eq. (7)).
9. Accumulate outputs: add Out_regular_f16 + Out_outlier_f16 to get the final FP16 output for the layer (Figure 2).
3.4.6 Core configurations and hyperparameters (what is and is not specified)¶
- Inference precision targets:
- Inputs/outputs are FP16; weights are stored in Int8 for most dimensions and FP16 for outlier dimensions during compute (Figure 2; Abstract; Section 3).
- Outlier threshold:
α = 6.0(Section 3.2; also used in outlier analysis criteria in Section 4.1).- Fraction of FP16 vs Int8 compute:
- The paper states “more than
99.9%of values are multiplied in 8-bit” (Abstract; Section 3.2). - Training-time hyperparameters (optimizer, LR schedule, batch size, total training tokens, compute budget, etc.):
- Not provided in the included content for the pretrained models; the paper points to external training details for the fairseq models (Section 3.3 refers to Artetxe et al. (2021)) but does not list optimizer/schedule/batch/context/tokenizer details in the excerpted sections.
- Hardware for evaluation:
- C4 perplexity evaluation uses
NVIDIA A40 GPUs(Section 3.3). - End-to-end BLOOM-176B benchmarks use
A100 80GBsetups (Appendix D.2; Table 6).
4. Key Insights and Innovations¶
- Empirical identification of “systematic outlier feature dimensions” as the core blocker for Int8 at scale.
- Novelty: The paper doesn’t just say “quantization error grows”; it isolates a concrete, structured phenomenon—few feature dimensions with extreme magnitudes that become ubiquitous across layers/tokens as scale increases (Section 4; Figure 3; Figure 4; Table 4).
-
Significance: It explains why methods that improve scaling granularity (row-wise/vector-wise) still fail when outliers are column/feature-aligned (Section 4.3).
-
Vector-wise quantizationfor matmul: scaling per inner product via row/column scales and outer-product dequantization. - Novelty relative to “one scale per tensor”: it increases scale granularity specifically aligned to matrix multiplication structure (Section 3.1; Eq. (7)).
-
Significance: The paper reports it preserves performance up to
2.7Bin their scaling study (Introduction; Figure 1 discussion), and it is part of the finalLLM.int8()recipe. -
Mixed-precision decomposition: a targeted FP16 fallback for only the outlier feature dimensions. - Novelty: Rather than a uniform mixed-precision or block scheme, it decomposes along feature dimensions determined by outlier detection and keeps those in FP16 (Section 3.2; Eq. (8); Figure 2).
-
Significance: It enables “no performance degradation” Int8 inference up to
175Bparameters, while still keeping >99.9% of multiplies in Int8 (Abstract; Figure 1). -
Demonstration of “load FP16/FP32 checkpoint → convert → run immediately” without fine-tuning.
- Novelty: Emphasis on post-training conversion with no further calibration/finetuning step, positioned as a practical usability win (Abstract; Introduction).
- Significance: Makes large open models materially more accessible (Table 2 / Appendix A Table 3).
5. Experimental Analysis¶
5.1 Evaluation methodology (datasets, metrics, baselines, setup)¶
Perplexity scaling benchmark (language modeling).
- Metric: C4 validation perplexity (lower is better) (Section 3.3; Table 1).
- Models: dense autoregressive transformers trained in fairseq, sizes 125M to 13B (Section 3.3).
- Data: pretrained on Books, English Wikipedia, CC-News, OpenWebText, CC-Stories, English CC100; evaluated on C4 validation (Section 3.3).
- Hardware: NVIDIA A40 GPUs (Section 3.3).
- Baselines compared (Table 1):
- Int8 absmax (tensor-wise)
- Int8 zeropoint
- Int8 absmax row-wise
- Int8 absmax vector-wise
- Int8 zeropoint vector-wise
- plus decomposition variants and the full LLM.int8() variants.
Zero-shot accuracy scaling benchmark.
- Metric: mean zero-shot accuracy on tasks WinoGrande, HellaSwag, PIQA, LAMBADA (Figure 1 caption).
- Models: OPT family up to 175B parameters (Section 3.3; Figure 1).
- Harness: EleutherAI evaluation harness (Section 3.3).
Speed benchmarks (ancillary). - Microbenchmarks: matrix multiplication speedups on GPT-3-sized dimensions (Appendix D.1; Table 5). - End-to-end: BLOOM-176B generation latency per token across GPU counts/batch sizes (Appendix D.2; Table 6).
5.2 Main quantitative results (with numbers)¶
Perplexity: LLM.int8() matches full precision across scale (125M→13B).
Table 1 is the central numeric evidence. Key points:
- At
13B, full precision perplexity is12.45(Table 1). Absmax LLM.int8() (vector-wise + decomp)achieves12.45at13B, matching the32-bit Floatrow at that scale (Table 1).- Many non-decomposition Int8 methods degrade badly at
13B: Int8 absmaxgives19.08perplexity at13B(vs12.45full precision) (Table 1).Int8 absmax vector-wisegives16.48at13B(Table 1).- Even
zeropoint vector-wise(which is best among non-decomposition in Table 1) is still worse than full precision at13B:13.47vs12.45(Table 1). - With decomposition included, both absmax and zeropoint versions of
LLM.int8()match full precision at all reported scales in Table 1: - Example at
6.7B: full precision13.30;Absmax LLM.int8()13.24;Zeropoint LLM.int8()13.24(Table 1). - Note: Table 1’s “32-bit Float” is the reference provided; the method operates with FP16 inputs/outputs (Section 2.1; Figure 2).
Zero-shot accuracy: LLM.int8() tracks the 16-bit baseline up to 175B, while baseline Int8 fails after outlier emergence.
- Figure 1 shows mean zero-shot accuracy vs parameter size for:
- 16-bit baseline
- an 8-bit baseline (caption indicates “most precise previous 8-bit quantization method as a baseline”)
- LLM.int8()
- The key qualitative result is the “emergence of outlier features” around 6.7B parameters; after that, regular 8-bit methods “fail,” while LLM.int8() maintains 16-bit accuracy (Figure 1 caption; Introduction).
Outlier analysis: small number of feature dimensions dominates and becomes ubiquitous.
- Figure 3 reports that after a phase shift (by parameter count, around 6.7B), outliers are present in 100% of layers and about 75% of sequence dimensions/tokens (Figure 3a; also summarized in Section 4.2 point (1)).
- Table 4 provides concrete counts and magnitudes:
- For 6.7B, outlier feature count is 6, “1-sided” count 6, layers 100%, sequence dimensions 75%, with quartiles like (-44, -40, -35) for the most common outlier (Table 4).
- For 13B, outlier feature count is 7, layers 100%, sequence dims 73%, quartiles (-63, -58, -45) (Table 4).
Speed: Int8 matmul speedups depend strongly on model dimension and overhead.
- Table 5 shows raw Int8 matmul can be faster than FP16 at large sizes (e.g., “Int8 without overhead” 2.29× at the “175B” shape), but when including quantize/dequantize and decomposition, small models slow down:
- LLM.int8() speedup at “13B” shape: 1.22×; at “175B” shape: 1.81× (Table 5).
- At small shapes, LLM.int8() is much slower (e.g., 0.14× at “Small”) (Table 5).
- End-to-end BLOOM-176B, Table 6 shows similar per-token ms to baseline, and the method allows using fewer GPUs:
- bfloat16 baseline on 8×A100 80GB: batch 32 is 9.94 ms/token.
- LLM.int8() on 3×A100 80GB: batch 32 is 9.11 ms/token (Table 6).
5.3 Do the experiments support the claims?¶
- Claim: no-degradation Int8 inference up to 175B.
- Supported qualitatively by Figure 1 for OPT scaling to
175Band quantitatively by Table 1 up to13Bperplexity (Figure 1; Table 1). -
The excerpted content does not provide a Table-1-like perplexity table up to 175B; the 175B evidence here is primarily the zero-shot curve plus memory/systems tables (Figure 1; Table 2 / Appendix A Table 3).
-
Claim: outliers are the reason quantization breaks at scale.
-
Supported by the descriptive statistics and phase-shift plots (Figure 3; Figure 4; Table 4) and the reported strong functional impact of removing outliers vs random dimensions (Section 4.2).
-
Ablations / comparisons present.
-
Table 1 effectively functions as an ablation over quantization type (absmax vs zeropoint), granularity (tensor/row/vector), and decomposition on/off.
-
Robustness checks.
- The paper checks outlier phenomena across models trained in different software frameworks (OpenAI GPT-2, fairseq, Tensorflow-mesh GPT-J) and across inference frameworks (fairseq vs Hugging Face) (Section 4.1; Table 4 note).
6. Limitations and Trade-offs¶
- Int8-only focus (data type scope).
-
The paper explicitly limits its study to the
Int8datatype and does not exploreFP8quantization, noting lack of current GPU/TPU support as rationale (Section 6). -
Scale ceiling of the study.
-
The method is demonstrated up to
175Bparameters, and the paper notes that additional emergent properties might appear at larger scales that could break the approach (Section 6). -
Attention function not quantized.
-
The paper does not apply Int8 multiplication to the attention softmax/value aggregation “attention function” because it is parameter-free and the work’s primary goal is weight memory reduction; it notes that quantizing the attention function likely requires additional methods (Section 6).
-
Training/finetuning not solved (and initial evidence suggests difficulty).
-
The core method is for inference; Appendix E/F show that doing training or fine-tuning with more components in Int8 (notably attention projections or attention itself) can degrade or become unstable, and mixed-precision decomposition only partially helps (Appendix E Table 7–8; Appendix F Table 10).
-
Runtime trade-offs / overhead.
-
Quantization and decomposition introduce overhead; Table 5 shows slowdowns for small model dimensions, with speedups appearing mainly for large GEMMs (Appendix D.1).
-
Thresholding assumptions.
- The mixed-precision decomposition depends on a threshold (
α = 6.0) and on the empirical fact that outliers are few and concentrated in feature dimensions (Section 3.2; Section 4). If a model’s outlier structure differs (more dimensions, less sparsity), overhead and memory could increase, and performance might degrade (this risk is implied by the method’s reliance on|O|being small, Section 3.2).
7. Implications and Future Directions¶
- Field-level implication: scaling-aware quantization needs model-behavior analysis, not just numeric tricks.
-
The paper suggests that quantization failure at scale is driven by emergent representational structure (systematic outlier feature dimensions), so effective low-precision inference must incorporate mechanisms that reflect that structure (Sections 3–4).
-
Practical applications / downstream use cases.
- The direct use case is serving or running very large pretrained LLMs under tight GPU memory constraints by converting linear layers to Int8 without losing accuracy (Abstract; Table 2 / Appendix A Table 3).
-
This is especially relevant when the barrier is model weight memory, since the method targets parameter-heavy layers (Introduction).
-
Repro/Integration Guidance (from what the paper includes).
- When to prefer
LLM.int8()over simpler quantization:- If model scale is large enough that systematic outliers emerge (paper highlights issues beginning around
6.7Bparameters),LLM.int8()is designed to preserve accuracy where standard absmax/row-wise/vector-wise fail (Figure 1; Table 1; Section 4.3). - For smaller models, quantization overhead may not be worth it, and Table 5 indicates slowdowns at small dimensions (Appendix D.1).
- If model scale is large enough that systematic outliers emerge (paper highlights issues beginning around
-
The paper provides an open-source implementation and a Hugging Face Transformers integration for models “that have linear layers” (Abstract; Introduction), which implies practical adoptability without custom training.
-
Future research directions explicitly suggested or strongly implied by the paper.
- Extend analysis and methods to
FP8when hardware support exists (Section 6). - Develop additional quantization techniques for the attention function (Section 6).
- Address the harder problem of Int8 (or generally low-bit) training/finetuning at scale, where Appendix E/F show current approaches can degrade or become unstable (Section 6; Appendix E/F).
- Improve engineering efficiency: Appendix D notes that optimized CUDA kernels for decomposition could reduce overhead and improve speedups (Appendix D.1).