The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits¶

🎯 Pitch¶

This paper introduces BitNet b1.58, a Transformer-based large language model where every weight is constrained to the ternary set {−1, 0, +1}, achieving state-of-the-art language modeling performance at just 1.58 bits per weight. By enabling highly efficient inference—dramatically reducing memory, latency, and energy costs compared to traditional 16-bit models—while matching full-precision accuracy at scale, BitNet b1.58 paves the way for scalable, sustainable, and edge-friendly AI deployments and inspires new hardware optimized for ultra-low-bit computation.

1. Executive Summary (2-3 sentences)¶

This paper introduces BitNet b1.58, a Transformer-based large language model (LLM) where every weight is ternary {−1, 0, +1} (“1.58-bit” weights) and activations are 8-bit. It achieves comparable language modeling quality to full-precision FP16/BF16 LLaMA-style baselines starting at 3B parameters while dramatically reducing inference latency, memory footprint, throughput cost, and estimated energy—enabling a new compute paradigm with almost no multiplications (Fig. 1; Tables 1–3; Fig. 2–3).

2. Context and Motivation¶

Problem addressed
Inference for modern LLMs is expensive in memory, latency, and energy because matrix multiplications in FP16/BF16 dominate cost. Memory bandwidth for loading weights from DRAM is a major bottleneck (Sec. 1).
Post-training quantization (PTQ) to 4–8 bits reduces costs but is suboptimal in accuracy and still relies on floating-point multiplies during matmul (Sec. 1, with prior work [XLS+23, FAHA23, LTT+23, CCKS23, TCS+24]).
Why it matters
Power is the limiting factor for performance in many chips; reducing arithmetic energy accelerates computation (Sec. 1, citing [Hor14]). Smaller weights also reduce DRAM bandwidth and capacity needs, improving both throughput and latency in deployed systems (Sec. 1).
Prior approaches and gaps
BitNet (1-bit) architectures remove multiplications by constraining weights to ±1, but modeling capacity can suffer and there is no explicit mechanism to “ignore” features (weight=0) (Sec. 1–2).
PTQ methods compress after training; they retain floating-point multiplications and require calibration procedures; accuracy degradation can be dataset/model-size dependent.
Positioning of this work
BitNet b1.58 is trained from scratch with ternary weights {−1, 0, +1} and 8-bit activations, preserving the no-multiplication paradigm of 1-bit transformers while adding a zero weight for feature filtering and better modeling capacity (Sec. 2). It adopts LLaMA-like architectural components for compatibility with common LLM tooling (Sec. 2).

3. Technical Approach¶

High-level mechanism: replace every dense nn.Linear layer in a Transformer with BitLinear, whose weights are constrained to ternary values {−1, 0, +1}. During inference, multiplying by −1/0/+1 reduces to sign flips, zeroing, and additions; matrix multiplication no longer needs floating-point multiplications (Fig. 1; Sec. 1–2).

Weight quantization to 1.58 bits (“ternary”)
Core idea: scale the full-precision weight matrix by the mean absolute value and then round-and-clip to {−1, 0, +1}.
In plain language: compute a single scalar γ per weight matrix equal to the average magnitude of its entries; divide all weights by γ, round each to the nearest integer in [−1, 1], and clip to this range to avoid overflow.
Formalization (Sec. 2, Eqs. (1)–(3)):
- γ = (1/(nm)) Σ_ij |W_ij| (Eq. 3)
- W_f = RoundClip(W/(γ+ε), −1, 1) (Eq. 1)
- RoundClip(x, a, b) = max(a, min(b, round(x))) (Eq. 2)
Rationale for this design:
- Single-scale “absmean” normalization stabilizes ternarization by matching each weight matrix’s dynamic range to the 3-level codebook. Adding 0 to the original ±1 codebook enables explicit “feature filtering,” i.e., the model can drop unhelpful signals via zero weights (Sec. 1–2).
Activation quantization to 8 bits
Activations are quantized to 8 bits with symmetric per-token scaling to [-Q_b, Q_b], removing zero-point quantization (Sec. 2).
Why symmetric and per-token? It simplifies implementation and system-level optimization and, in experiments, has “negligible effects” on performance (Sec. 2). Symmetry also matches the ternary, zero-centered weight distribution.
Computation model and kernels
Because weights are ternary, matrix multiplication reduces to integer additions and sign flips; multiplications are largely eliminated (Sec. 1–2; Fig. 1).
In GPU experiments, a 2-bit kernel from Ladder [WMC+23] is used to implement fast low-bit operations (Sec. 3), indicating the system-level feasibility even without specialized hardware.
Architecture and training recipe
BitNet b1.58 adopts LLaMA-like modules: RMSNorm [ZS19], SwiGLU [Sha20], rotary positional embeddings [SAL+24], and removes all biases (Sec. 2).
It is trained from scratch on the same data and tokens as full-precision baselines to ensure matched comparisons (Sec. 3: 100B tokens on RedPajama [Com23]; and a separate 2T-token experiment against StableLM-3B [TBMR]).
Implementation friendliness
The LLaMA-alike design means it can be integrated with HuggingFace, vLLM [KLZ+23], and llama.cpp with minimal changes (Sec. 2).

Analogy: think of each weight as a tiny switch with three positions: “pass the signal” (+1), “invert the signal” (−1), or “ignore it” (0). With only these switches, the network avoids expensive multiplication, and computing matrix-vector products becomes flipping signs and summing integers.

4. Key Insights and Innovations¶

Ternary weights with explicit zero (“1.58-bit”) as a Pareto improvement
What’s new: extends 1-bit {−1, +1} weights to ternary {−1, 0, +1} using a simple absmean scaling and rounding (Sec. 2).
Why it matters: keeps the no-multiplication paradigm and low memory footprint while improving modeling capacity via explicit “feature filtering” (zeros). The paper reports accuracy parity with FP16 starting from 3B parameters (Tables 1–2), which is a qualitative step beyond prior 1-bit results.
Symmetric per-token 8-bit activation quantization
What’s new: activations are scaled to [-Q_b, Q_b] per token, removing zero points (Sec. 2).
Why it matters: simplifies hardware/software (no asymmetric zero-point handling) with negligible measured accuracy impact, and halves activation precision vs FP16—important for KV cache memory in long-context inference (Sec. 4).
End-to-end system-level efficiency with measured speedups
What’s new: demonstrates practical inference benefits on GPUs using an existing low-bit kernel (Sec. 3), not just theoretical energy models.
Why it matters: Latency, memory, batch size, and throughput improvements are large and grow with model size (Fig. 2; Table 3), making the approach immediately relevant for serving.
Energy scaling perspective and “new scaling law” equivalences
What’s new: estimates 71.4× lower arithmetic energy for matmul on 7nm and shows end-to-end energy advantages that increase with model size (Fig. 3).
Why it matters: It reframes how to scale LLMs—bigger models may be cheaper to deploy if weights are 1.58-bit, leading to equivalence rules of thumb (e.g., 70B 1.58-bit ≈ 13B FP16 in cost; Sec. 3, “BitNet b1.58 is enabling a new scaling law…”).

Fundamental innovations: the ternary no-multiplication compute paradigm at LLM scale and the demonstration of accuracy parity at multi-billion parameter scale. Incremental but important: the specific quantization choices (absmean, symmetric per-token) and LLaMA-compatible design for ecosystem adoption.

5. Experimental Analysis¶

Setup
Pretraining: Both BitNet b1.58 and reproduced LLaMA FP16 baselines are trained on RedPajama for 100B tokens to match data/tokens (Sec. 3).
Metrics: Validation perplexity on WikiText2 and C4; zero-shot accuracy on ARC-Easy/Challenge, HellaSwag, Winogrande, PIQA, OpenBookQA, BoolQ using lm-evaluation-harness (Sec. 3).
System measurements: GPU runtime memory and latency using FasterTransformer with Ladder’s 2-bit kernel; time per output token (Sec. 3). Throughput measured on two 80GB A100s with pipeline parallelism (Sec. 3; Table 3).
Energy: Estimated using component-wise arithmetic energy on 7nm (Horowitz model) and reported end-to-end energy trends (Fig. 3).
Main quantitative results
Accuracy and perplexity parity starting at 3B:
- Perplexity (Table 1): at 3B, BitNet b1.58 PPL = 9.91 vs LLaMA FP16 = 10.04; at 1.3B and 700M, small gaps remain.
- Zero-shot accuracy (Table 2): at 3B, average 50.2 vs 49.7 (BitNet higher).
Latency and memory (Fig. 2; Table 1):
- At 3B: latency 2.71× faster (1.87 ms vs 5.07 ms); memory 3.55× lower (2.22 GB vs 7.89 GB) while matching perplexity (Table 1).
- Scaling trends (Fig. 2, left/right): speedup grows with size—1.67× (1.3B), 2.90× (7B), 3.68× (13B), 4.10× (70B); memory reductions similarly grow—2.93×, 4.40×, 5.12×, 7.16×.
Throughput and batch size (Table 3):
- 70B models on 2×A100: max batch size 176 vs 16 (11.0×), throughput 2977 vs 333 tokens/s (8.9×).
Energy (Fig. 3):
- Arithmetic energy composition: majority INT8 adds for BitNet b1.58; FP16 adds and multiplies for LLaMA. Estimated 71.4× arithmetic energy savings for matmul on 7nm.
- End-to-end energy advantage increases with model size: 18.6× (1.3B) up to 41.2× (70B).
More data (Table 4, 2T tokens):
- Against StableLM-3B (2T tokens), BitNet b1.58 3B scores higher on all five tasks; average 74.34 vs 73.22.
Representative citations of results

Table 1: “BitNet b1.58 3B … Latency 1.87 ms (2.71×), Memory 2.22 GB (3.55×), PPL 9.91 vs LLaMA 3B PPL 10.04.”

Table 2: “BitNet b1.58 3B: Avg. 50.2 vs LLaMA 3B: 49.7.”

Figure 2: “BitNet b1.58 70B is 4.1× faster … memory 7.16× lower.”

Table 3: “70B: Max batch 176 (11.0×), Throughput 2977 tokens/s (8.9×).”

Figure 3: “71.4× arithmetic ops energy savings; end-to-end energy 18.6× to 41.2× across sizes.”

Do the experiments support the claims?
For equal-size comparisons on matched data/tokens, the 3B and larger results convincingly show quality parity or superiority with strong system-level wins (Tables 1–2, Fig. 2–3).
Throughput and memory experiments use established codebases (FasterTransformer; Ladder kernel), lending credibility to the reported speedups on current GPUs (Sec. 3).
Energy is estimated (not measured) based on a standard model; trends are plausible but depend on hardware assumptions (Fig. 3).
Ablations and robustness
The excerpt does not report ablations isolating the impact of: absmean vs alternative quantizers, per-token activation scaling, or the contribution of the zero state (0) to accuracy.
No instruction-following or chat fine-tuning evaluations are reported here; evaluations are zero-shot on common benchmarks plus perplexity.
Conditions and trade-offs observed
At small sizes (700M–1.3B), BitNet b1.58 lags slightly in perplexity and average zero-shot accuracy (Tables 1–2), but the gap closes and reverses by 3B. System gains exist at all scales.

6. Limitations and Trade-offs¶

Training from scratch instead of post-training quantization
The approach relies on full pretraining with ternary weights and 8-bit activations. It does not address converting an existing FP16 model to 1.58-bit with minimal retraining (Sec. 2–3).
Embeddings remain full-precision
Memory and speed improvements increase with size partly because embeddings (kept in full precision) occupy a smaller fraction of total parameters in larger models (Sec. 3, Memory discussion). This slightly limits benefits at small scales.
Hardware dependencies and energy estimates
Latency and memory gains are measured on GPUs, but the largest energy savings are extrapolated from a 7nm energy model (Fig. 3 left). Real-world energy depends on platform specifics and system components beyond arithmetic (Fig. 3 right suggests non-matmul costs are non-negligible at smaller scales).
Limited task coverage and analysis
Accuracy comparisons focus on perplexity and a standard zero-shot suite. There is no report here on instruction tuning, safety/alignment evaluations, multilingual performance, robustness to distribution shift, or long-context reasoning beyond KV-cache memory implications.
Small-model performance
Below 3B parameters, BitNet b1.58 shows small quality deficits vs FP16 (Tables 1–2). The ternary constraint may hurt expressivity when capacity is limited.
Implementation specifics not deeply detailed
The excerpt does not specify optimizer, learning-rate schedules, or gradient quantization. Practical stability and convergence details for training such low-bit models at scale remain an area where guidance would help adoption.

7. Implications and Future Directions¶

How this changes the landscape
A credible path to “no-multiplication” LLM inference: by achieving parity with FP16 at 3B and above, 1.58-bit weights plus 8-bit activations shift the cost-quality frontier, enabling larger models to be served at the cost of much smaller FP16 models (Sec. 3, “new scaling law”).
It invites hardware specialization: integer-add-dominant matmuls with ternary weights motivate accelerators optimized for bit-serial or lookup-based arithmetic (Fig. 1; Sec. 4 “New Hardware for 1-bit LLMs”).
Follow-up research enabled or suggested
Quantization methodology
- Ablate and optimize quantizers (e.g., per-channel vs per-matrix scaling; alternatives to absmean; learned codebooks).
- Explore activation precision further; the paper notes potential lossless compression of activations to 4 bits “or even lower” for KV caches (Sec. 4).
Training algorithms
- Study optimizers and regularizers tailored for ternary weights; investigate gradient quantization and straight-through estimators at scale.
- Investigate curriculum or distillation methods to transfer from FP16 checkpoints to 1.58-bit with minimal retraining.
Architectures
- 1.58-bit Mixture-of-Experts (MoE) to cut FLOPs while also reducing memory and communication overhead (Sec. 4). With smaller parameters and activations, more experts may fit on a single device.
- Long-sequence models: 8-bit (and potentially 4-bit) KV caches directly address the main practical barrier for long-context inference (Sec. 4).
Systems and hardware
- Integrate low-bit matmuls into mainstream inference stacks (e.g., vLLM, Triton kernels) and CPUs for edge/mobile, where the paper argues 1.58-bit is “more friendly” (Sec. 4).
- Co-design with new hardware (e.g., LPUs), building dataflows specialized for ternary weights and integer additions (Sec. 4).
Practical applications
Cost-effective deployment of larger models for chatbots, code assistants, and enterprise search where latency and energy budgets are strict.
Edge and mobile on-device inference with significantly lower memory and power, enabling privacy-preserving or offline use cases (Sec. 4).
High-throughput batch serving: 8.9× throughput at 70B on 2×A100 (Table 3) suggests strong benefits for large-scale API providers.

In summary, BitNet b1.58 demonstrates that LLMs with ternary weights and 8-bit activations can match full-precision accuracy at multi-billion parameter scales while delivering large, increasing system-level gains. The combination of a simple absmean ternarization, symmetric per-token activation quantization, and LLaMA-compatible design makes this a compelling direction for both immediate deployment and future hardware-software co-design.