The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits¶
ArXiv: 2402.17764
🎯 Pitch¶
This paper introduces BitNet b1.58, a Transformer-based large language model where every weight is constrained to the ternary set {−1, 0, +1}, achieving state-of-the-art language modeling performance at just 1.58 bits per weight. By enabling highly efficient inference—dramatically reducing memory, latency, and energy costs compared to traditional 16-bit models—while matching full-precision accuracy at scale, BitNet b1.58 paves the way for scalable, sustainable, and edge-friendly AI deployments and inspires new hardware optimized for ultra-low-bit computation.
1. Executive Summary (2-3 sentences)¶
This paper introduces BitNet b1.58, a Transformer-based large language model (LLM) where every weight is ternary {−1, 0, +1} (“1.58-bit” weights) and activations are 8-bit. It achieves comparable language modeling quality to full-precision FP16/BF16 LLaMA-style baselines starting at 3B parameters while dramatically reducing inference latency, memory footprint, throughput cost, and estimated energy—enabling a new compute paradigm with almost no multiplications (Fig. 1; Tables 1–3; Fig. 2–3).
2. Context and Motivation¶
- Problem addressed
- Inference for modern LLMs is expensive in memory, latency, and energy because matrix multiplications in FP16/BF16 dominate cost. Memory bandwidth for loading weights from DRAM is a major bottleneck (Sec. 1).
-
Post-training quantization (PTQ) to 4–8 bits reduces costs but is suboptimal in accuracy and still relies on floating-point multiplies during matmul (Sec. 1, with prior work [XLS+23, FAHA23, LTT+23, CCKS23, TCS+24]).
-
Why it matters
-
Power is the limiting factor for performance in many chips; reducing arithmetic energy accelerates computation (Sec. 1, citing [Hor14]). Smaller weights also reduce DRAM bandwidth and capacity needs, improving both throughput and latency in deployed systems (Sec. 1).
-
Prior approaches and gaps
- BitNet (1-bit) architectures remove multiplications by constraining weights to ±1, but modeling capacity can suffer and there is no explicit mechanism to “ignore” features (weight=0) (Sec. 1–2).
-
PTQ methods compress after training; they retain floating-point multiplications and require calibration procedures; accuracy degradation can be dataset/model-size dependent.
-
Positioning of this work
BitNet b1.58is trained from scratch with ternary weights{−1, 0, +1}and 8-bit activations, preserving the no-multiplication paradigm of 1-bit transformers while adding a zero weight for feature filtering and better modeling capacity (Sec. 2). It adopts LLaMA-like architectural components for compatibility with common LLM tooling (Sec. 2).
3. Technical Approach¶
High-level mechanism: replace every dense nn.Linear layer in a Transformer with BitLinear, whose weights are constrained to ternary values {−1, 0, +1}. During inference, multiplying by −1/0/+1 reduces to sign flips, zeroing, and additions; matrix multiplication no longer needs floating-point multiplications (Fig. 1; Sec. 1–2).
- Weight quantization to 1.58 bits (“ternary”)
- Core idea: scale the full-precision weight matrix by the mean absolute value and then round-and-clip to {−1, 0, +1}.
- In plain language: compute a single scalar
γper weight matrix equal to the average magnitude of its entries; divide all weights byγ, round each to the nearest integer in [−1, 1], and clip to this range to avoid overflow. - Formalization (Sec. 2, Eqs. (1)–(3)):
γ = (1/(nm)) Σ_ij |W_ij|(Eq. 3)W_f = RoundClip(W/(γ+ε), −1, 1)(Eq. 1)RoundClip(x, a, b) = max(a, min(b, round(x)))(Eq. 2)
-
Rationale for this design:
- Single-scale “absmean” normalization stabilizes ternarization by matching each weight matrix’s dynamic range to the 3-level codebook. Adding
0to the original ±1 codebook enables explicit “feature filtering,” i.e., the model can drop unhelpful signals via zero weights (Sec. 1–2).
- Single-scale “absmean” normalization stabilizes ternarization by matching each weight matrix’s dynamic range to the 3-level codebook. Adding
-
Activation quantization to 8 bits
- Activations are quantized to 8 bits with symmetric per-token scaling to
[-Q_b, Q_b], removing zero-point quantization (Sec. 2). -
Why symmetric and per-token? It simplifies implementation and system-level optimization and, in experiments, has “negligible effects” on performance (Sec. 2). Symmetry also matches the ternary, zero-centered weight distribution.
-
Computation model and kernels
- Because weights are ternary, matrix multiplication reduces to integer additions and sign flips; multiplications are largely eliminated (Sec. 1–2; Fig. 1).
-
In GPU experiments, a 2-bit kernel from Ladder [WMC+23] is used to implement fast low-bit operations (Sec. 3), indicating the system-level feasibility even without specialized hardware.
-
Architecture and training recipe
BitNet b1.58adopts LLaMA-like modules:RMSNorm[ZS19],SwiGLU[Sha20], rotary positional embeddings [SAL+24], and removes all biases (Sec. 2).-
It is trained from scratch on the same data and tokens as full-precision baselines to ensure matched comparisons (Sec. 3: 100B tokens on RedPajama [Com23]; and a separate 2T-token experiment against StableLM-3B [TBMR]).
-
Implementation friendliness
- The LLaMA-alike design means it can be integrated with HuggingFace, vLLM [KLZ+23], and llama.cpp with minimal changes (Sec. 2).
Analogy: think of each weight as a tiny switch with three positions: “pass the signal” (+1), “invert the signal” (−1), or “ignore it” (0). With only these switches, the network avoids expensive multiplication, and computing matrix-vector products becomes flipping signs and summing integers.
4. Key Insights and Innovations¶
- Ternary weights with explicit zero (“1.58-bit”) as a Pareto improvement
- What’s new: extends 1-bit
{−1, +1}weights to ternary{−1, 0, +1}using a simple absmean scaling and rounding (Sec. 2). -
Why it matters: keeps the no-multiplication paradigm and low memory footprint while improving modeling capacity via explicit “feature filtering” (zeros). The paper reports accuracy parity with FP16 starting from 3B parameters (Tables 1–2), which is a qualitative step beyond prior 1-bit results.
-
Symmetric per-token 8-bit activation quantization
- What’s new: activations are scaled to
[-Q_b, Q_b]per token, removing zero points (Sec. 2). -
Why it matters: simplifies hardware/software (no asymmetric zero-point handling) with negligible measured accuracy impact, and halves activation precision vs FP16—important for KV cache memory in long-context inference (Sec. 4).
-
End-to-end system-level efficiency with measured speedups
- What’s new: demonstrates practical inference benefits on GPUs using an existing low-bit kernel (Sec. 3), not just theoretical energy models.
-
Why it matters: Latency, memory, batch size, and throughput improvements are large and grow with model size (Fig. 2; Table 3), making the approach immediately relevant for serving.
-
Energy scaling perspective and “new scaling law” equivalences
- What’s new: estimates 71.4× lower arithmetic energy for matmul on 7nm and shows end-to-end energy advantages that increase with model size (Fig. 3).
- Why it matters: It reframes how to scale LLMs—bigger models may be cheaper to deploy if weights are 1.58-bit, leading to equivalence rules of thumb (e.g., 70B 1.58-bit ≈ 13B FP16 in cost; Sec. 3, “BitNet b1.58 is enabling a new scaling law…”).
Fundamental innovations: the ternary no-multiplication compute paradigm at LLM scale and the demonstration of accuracy parity at multi-billion parameter scale. Incremental but important: the specific quantization choices (absmean, symmetric per-token) and LLaMA-compatible design for ecosystem adoption.
5. Experimental Analysis¶
- Setup
- Pretraining: Both
BitNet b1.58and reproduced LLaMA FP16 baselines are trained on RedPajama for 100B tokens to match data/tokens (Sec. 3). - Metrics: Validation perplexity on WikiText2 and C4; zero-shot accuracy on ARC-Easy/Challenge, HellaSwag, Winogrande, PIQA, OpenBookQA, BoolQ using lm-evaluation-harness (Sec. 3).
- System measurements: GPU runtime memory and latency using FasterTransformer with Ladder’s 2-bit kernel; time per output token (Sec. 3). Throughput measured on two 80GB A100s with pipeline parallelism (Sec. 3; Table 3).
-
Energy: Estimated using component-wise arithmetic energy on 7nm (Horowitz model) and reported end-to-end energy trends (Fig. 3).
-
Main quantitative results
- Accuracy and perplexity parity starting at 3B:
- Perplexity (Table 1): at 3B,
BitNet b1.58PPL = 9.91 vs LLaMA FP16 = 10.04; at 1.3B and 700M, small gaps remain. - Zero-shot accuracy (Table 2): at 3B, average 50.2 vs 49.7 (BitNet higher).
- Perplexity (Table 1): at 3B,
- Latency and memory (Fig. 2; Table 1):
- At 3B: latency 2.71× faster (1.87 ms vs 5.07 ms); memory 3.55× lower (2.22 GB vs 7.89 GB) while matching perplexity (Table 1).
- Scaling trends (Fig. 2, left/right): speedup grows with size—1.67× (1.3B), 2.90× (7B), 3.68× (13B), 4.10× (70B); memory reductions similarly grow—2.93×, 4.40×, 5.12×, 7.16×.
- Throughput and batch size (Table 3):
- 70B models on 2×A100: max batch size 176 vs 16 (11.0×), throughput 2977 vs 333 tokens/s (8.9×).
- Energy (Fig. 3):
- Arithmetic energy composition: majority INT8 adds for
BitNet b1.58; FP16 adds and multiplies for LLaMA. Estimated 71.4× arithmetic energy savings for matmul on 7nm. - End-to-end energy advantage increases with model size: 18.6× (1.3B) up to 41.2× (70B).
- Arithmetic energy composition: majority INT8 adds for
-
More data (Table 4, 2T tokens):
- Against StableLM-3B (2T tokens),
BitNet b1.58 3Bscores higher on all five tasks; average 74.34 vs 73.22.
- Against StableLM-3B (2T tokens),
-
Representative citations of results
Table 1: “BitNet b1.58 3B … Latency 1.87 ms (2.71×), Memory 2.22 GB (3.55×), PPL 9.91 vs LLaMA 3B PPL 10.04.”
Table 2: “BitNet b1.58 3B: Avg. 50.2 vs LLaMA 3B: 49.7.”
Figure 2: “BitNet b1.58 70B is 4.1× faster … memory 7.16× lower.”
Table 3: “70B: Max batch 176 (11.0×), Throughput 2977 tokens/s (8.9×).”
Figure 3: “71.4× arithmetic ops energy savings; end-to-end energy 18.6× to 41.2× across sizes.”
- Do the experiments support the claims?
- For equal-size comparisons on matched data/tokens, the 3B and larger results convincingly show quality parity or superiority with strong system-level wins (Tables 1–2, Fig. 2–3).
- Throughput and memory experiments use established codebases (FasterTransformer; Ladder kernel), lending credibility to the reported speedups on current GPUs (Sec. 3).
-
Energy is estimated (not measured) based on a standard model; trends are plausible but depend on hardware assumptions (Fig. 3).
-
Ablations and robustness
- The excerpt does not report ablations isolating the impact of: absmean vs alternative quantizers, per-token activation scaling, or the contribution of the zero state (0) to accuracy.
-
No instruction-following or chat fine-tuning evaluations are reported here; evaluations are zero-shot on common benchmarks plus perplexity.
-
Conditions and trade-offs observed
- At small sizes (700M–1.3B),
BitNet b1.58lags slightly in perplexity and average zero-shot accuracy (Tables 1–2), but the gap closes and reverses by 3B. System gains exist at all scales.
6. Limitations and Trade-offs¶
- Training from scratch instead of post-training quantization
-
The approach relies on full pretraining with ternary weights and 8-bit activations. It does not address converting an existing FP16 model to 1.58-bit with minimal retraining (Sec. 2–3).
-
Embeddings remain full-precision
-
Memory and speed improvements increase with size partly because embeddings (kept in full precision) occupy a smaller fraction of total parameters in larger models (Sec. 3, Memory discussion). This slightly limits benefits at small scales.
-
Hardware dependencies and energy estimates
-
Latency and memory gains are measured on GPUs, but the largest energy savings are extrapolated from a 7nm energy model (Fig. 3 left). Real-world energy depends on platform specifics and system components beyond arithmetic (Fig. 3 right suggests non-matmul costs are non-negligible at smaller scales).
-
Limited task coverage and analysis
-
Accuracy comparisons focus on perplexity and a standard zero-shot suite. There is no report here on instruction tuning, safety/alignment evaluations, multilingual performance, robustness to distribution shift, or long-context reasoning beyond KV-cache memory implications.
-
Small-model performance
-
Below 3B parameters,
BitNet b1.58shows small quality deficits vs FP16 (Tables 1–2). The ternary constraint may hurt expressivity when capacity is limited. -
Implementation specifics not deeply detailed
- The excerpt does not specify optimizer, learning-rate schedules, or gradient quantization. Practical stability and convergence details for training such low-bit models at scale remain an area where guidance would help adoption.
7. Implications and Future Directions¶
- How this changes the landscape
- A credible path to “no-multiplication” LLM inference: by achieving parity with FP16 at 3B and above, 1.58-bit weights plus 8-bit activations shift the cost-quality frontier, enabling larger models to be served at the cost of much smaller FP16 models (Sec. 3, “new scaling law”).
-
It invites hardware specialization: integer-add-dominant matmuls with ternary weights motivate accelerators optimized for bit-serial or lookup-based arithmetic (Fig. 1; Sec. 4 “New Hardware for 1-bit LLMs”).
-
Follow-up research enabled or suggested
- Quantization methodology
- Ablate and optimize quantizers (e.g., per-channel vs per-matrix scaling; alternatives to absmean; learned codebooks).
- Explore activation precision further; the paper notes potential lossless compression of activations to 4 bits “or even lower” for KV caches (Sec. 4).
- Training algorithms
- Study optimizers and regularizers tailored for ternary weights; investigate gradient quantization and straight-through estimators at scale.
- Investigate curriculum or distillation methods to transfer from FP16 checkpoints to 1.58-bit with minimal retraining.
- Architectures
- 1.58-bit Mixture-of-Experts (MoE) to cut FLOPs while also reducing memory and communication overhead (Sec. 4). With smaller parameters and activations, more experts may fit on a single device.
- Long-sequence models: 8-bit (and potentially 4-bit) KV caches directly address the main practical barrier for long-context inference (Sec. 4).
-
Systems and hardware
- Integrate low-bit matmuls into mainstream inference stacks (e.g., vLLM, Triton kernels) and CPUs for edge/mobile, where the paper argues 1.58-bit is “more friendly” (Sec. 4).
- Co-design with new hardware (e.g., LPUs), building dataflows specialized for ternary weights and integer additions (Sec. 4).
-
Practical applications
- Cost-effective deployment of larger models for chatbots, code assistants, and enterprise search where latency and energy budgets are strict.
- Edge and mobile on-device inference with significantly lower memory and power, enabling privacy-preserving or offline use cases (Sec. 4).
- High-throughput batch serving: 8.9× throughput at 70B on 2×A100 (Table 3) suggests strong benefits for large-scale API providers.
In summary, BitNet b1.58 demonstrates that LLMs with ternary weights and 8-bit activations can match full-precision accuracy at multi-billion parameter scales while delivering large, increasing system-level gains. The combination of a simple absmean ternarization, symmetric per-token activation quantization, and LLaMA-compatible design makes this a compelling direction for both immediate deployment and future hardware-software co-design.