SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features¶

🎯 Pitch¶

SigLIP 2 introduces a powerful new family of multilingual vision-language encoders that, for the first time, unifies sigmoid-based image-text alignment, decoder-based localization/captioning pretraining, and self-supervised local feature learning into a single open-source model suite. By doing so, SigLIP 2 raises the bar for zero-shot classification, retrieval, localization, and dense prediction tasks, while adding multi-resolution, aspect-ratio–preserving inference and dramatically reducing representation bias—directly addressing the needs of real-world applications like search, OCR, document analysis, and global multilingual use. This breakthrough empowers the next generation of vision-language models, especially as front-ends for multimodal LLMs, and sets a new standard for both performance and fairness across languages and domains.

1. Executive Summary¶

SigLIP 2 is a family of open, multilingual vision–language encoders that unifies three previously separate ideas—sigmoid-based image–text alignment, decoder-based caption/localization pretraining, and self-supervised local feature learning—into one training recipe. Across four model sizes, it improves zero-shot classification, retrieval, localization, and dense prediction, while adding single-checkpoint multi-resolution and aspect-ratio–preserving inference (NaFlex) and significantly reducing measured representation bias (Sec. 2; Tables 1–5; Figs. 3–6).

2. Context and Motivation¶

Problem addressed
CLIP-style encoders are strong at semantic alignment (e.g., “a dog” ↔ a dog image) but lag on localization (where in the image?), dense prediction (pixel-level tasks), and multilingual robustness; open models often follow older CLIP-style recipes and miss recent advances (Sec. 1).
Standard fixed-square resizing distorts aspect ratios, hurting OCR, documents, UI screenshots, and other aspect-sensitive inputs (Sec. 2.4.2; Fig. 3).
Smaller models underperform at low compute budgets; multilingual training and fairness require care in data mixture and filtering (Sec. 2.1; Sec. 3.5).
Why it matters
Real-world uses (search, retrieval, OCR, document understanding, mobile UIs, robotics) need both global semantics and local/dense understanding, often across many languages and aspect ratios.
These encoders are the front-end to multimodal LLMs; better encoders directly raise the ceiling of VLM performance (Sec. 3.2; Fig. 4).
Prior approaches and gaps
CLIP/ALIGN contrastive training excels at alignment but lacks strong local/dense features (Sec. 1).
Additions from separate lines of work:
- Re-captioning with stronger captions (e.g., CoCa, TIPS) and captioner-based pretraining with decoders (e.g., LocCa) improve OCR/localization (Sec. 1; Sec. 2.2).
- Self-supervision (DINO-v2, SILC) improves dense features (Sec. 2.3).
- Open-weight encoders exist (OpenCLIP, EVA-CLIP, MetaCLIP, DFN) but generally hew close to the CLIP recipe and are largely English-centric (Table 1; Sec. 1).
Missing unification: no single open model combined all of these improvements with multilingual support, aspect-ratio preservation, and small-model optimization.
How this work positions itself
Builds on SigLIP (sigmoid loss) and integrates:
- A decoder for captioning, grounded captioning, and referring expressions (LocCa-style),
- Self-distillation and masked prediction for local/dense features (SILC/TIPS),
- A NaFlex variant for native aspect ratio and variable token counts,
- Multilingual tokenizer and data with de-biasing,
- Active data curation to lift small models (Sec. 2).
Releases open checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), g (1B) and remains backward-compatible with SigLIP architecture (Sec. 1; Sec. 2.1).

3. Technical Approach¶

This section unpacks “what is trained,” “how it is trained,” and “why those choices were made.”

Architecture (Sec. 2.1)
Vision and text towers both use Vision Transformers (ViT) with attention pooling via a MAP head (multi-head attention pooling; it replaces a CLS token for global pooling).
Text length is 64 tokens with a multilingual Gemma tokenizer (256k vocab); text is lower-cased before tokenization.
For the largest model g/16, the vision encoder is paired with an So400m-sized text encoder (Sec. 2.1).
Data and optimization (Sec. 2.1)
Training corpus: WebLI (10B images, 12B alt-texts) spanning 109 languages.
Mixture: 90% English, 10% non-English (recommended balance from prior multilingual work).
De-biasing: filters from [2] applied to mitigate first-order (representation imbalance) and second-order (attribute associations) biases.
Compute: Adam (lr 1e-3), decoupled weight decay 1e-4, gradient clip 1; batch size 32k, cosine schedule with 20k warmup, total 40B examples; trained on up to 2048 TPUv5e with fully sharded data parallelism (Sec. 2.1).
Training step 1: Global alignment + decoder pretraining (Sec. 2.2)
Sigmoid loss (“SigLIP”): Instead of CLIP’s softmax over the batch, it treats each image–text pair in the batch as a binary classification (match vs non-match) with a logistic regression per pair. This avoids normalization across the entire batch and has empirically strong alignment properties (Fig. 1, “Sigmoid loss (100%)”).
Decoder-based pretraining (“LocCa”):
- Attach a transformer decoder with cross-attention to the un-pooled vision tokens.
- Train it on three tasks, all in-batch:
- Image captioning,
- Grounded/dense captioning (predict region-specific captions given box coordinates),
- Referring expression comprehension (predict bounding boxes for region-descriptive captions).
- Region–caption pairs are auto-annotated via open-vocabulary detection with n-grams and a fixed object-category set (Sec. 2.2).
- 50% of captioning uses “parallel prediction” (non-autoregressive tokens predicted from mask tokens without causal masking) to reduce decoding compute (Sec. 2.2).
- A chunked decoder loss reduces memory with the large 256k vocabulary.
Losses: Sigmoid loss and decoder loss are combined with equal weight during this stage (Fig. 1; Sec. 2.2).
Why: Caption/grounding tasks improve OCR/localization; sigmoid loss keeps strong global alignment.
Training step 2 (last 20%): Local feature learning via self-distillation and masked prediction (Sec. 2.3)
Self-distillation (local-to-global consistency):
- A teacher network is an exponential moving average (EMA) of the student parameters.
- Student sees 8 local crops (“partial views”); teacher sees 1 global view.
- Student matches teacher features via a separate MLP head in a high-dimensional feature space (“consistency loss”), encouraging local patches to align with the global semantics (Sec. 2.3).
Masked prediction:
- Replace 50% of student’s patch embeddings with a learned mask token and train to match teacher’s per-patch features at masked locations (same loss family as above) (Sec. 2.3).
Scheduling and weighting:
- Added at 80% of training to avoid early distortions of image–text alignment (data augmentations only applied to the extra views, not the alignment view) (Sec. 2.3).
- Loss weights: 1.0 (consistency) and 0.25 (masked); then globally re-weighted by model size (B/L/So/g use multipliers 0.25/0.5/1.0/0.5) to balance global vs dense-task quality (Sec. 2.3).
Why: This late-stage, view-separated local learning strengthens dense features while protecting global image–text alignment.
Adapting to different resolutions
Fixed-resolution checkpoints (Sec. 2.4.1)
- At 95% training, resume from the 256-token model and resize positional embeddings to the target sequence length; sometimes also resize the patch embedding (16→14) using FlexiViT’s pseudoinverse (PI) strategy.
- Continue training with all losses. The common “small LR, no weight decay” fine-tuning did not work consistently across sizes/resolutions (Sec. 2.4.1).
NaFlex: native aspect ratio, variable sequence length (Sec. 2.4.2)
- Preprocess each image to multiples of the patch size with minimal distortion, preserving aspect ratio (like NaViT). Limit total tokens to a target sequence length.
- Encode with non-square positional embeddings by bilinear-resizing the learned 2D grid to match the resized patch grid; mask out any padding tokens (Sec. 2.4.2).
- Train from 90% completion of the default model by switching to aspect-preserving resizing and uniformly sampling sequence lengths from {128, 256, 576, 784, 1024}; stretch the last-10% schedule by 3.75× so each length gets enough updates; halve batch size and double steps for the largest length to fit memory.
- To keep complexity manageable, NaFlex does not include the self-distillation/masked-prediction heads during this adaptation (Sec. 2.4.2).
Boosting small models via active data curation (Sec. 2.5)
For B/16 and B/32, continue training for 4B examples with only the sigmoid image–text loss, lr 1e-5, no weight decay.
Use ACID (Active CuratIon via Distillation): At each step, compute a “learnability” score for many candidates using the frozen teacher and current learner; select a top fraction to form the actual batch (filtering ratio 0.5; for B/32, 0.75) from a larger super-batch (Sec. 2.5).
Teacher choice: fine-tune a strong diversified teacher (SigLIP 2 So400m) for 1B examples on a curated dataset so “implicit distillation through data” approximates ACED’s benefits without explicit soft-label loss—saving compute (Sec. 2.5).
Why: Curating harder/useful examples for the learner meaningfully raises small-model quality at constant compute.

4. Key Insights and Innovations¶

Unifying three complementary training signals in one recipe (Fig. 1; Sec. 2)
What’s new: A single encoder is trained with (a) sigmoid alignment, (b) a decoder for caption/grounding/referring expressions, and (c) late-stage self-distillation + masked prediction.
Why it matters: Each piece targets a different weakness—global alignment, localization/OCR, and dense semantics—yielding broad gains with one set of weights (Tables 1–5).
Late-stage, view-separated local feature learning (Sec. 2.3)
Different from prior: Applies consistency/masking only in the final 20%, and only on additional augmented views, to avoid harming image–text alignment while still improving dense features (a known tension).
Impact: Large boosts on segmentation/depth/normals and localization tasks without sacrificing zero-shot retrieval/classification (Table 2; Table 5).
NaFlex: one checkpoint, many resolutions, native aspect ratio (Sec. 2.4.2; Fig. 3; Table 7)
Different from prior: Combines FlexiViT-style variable token counts with NaViT-style aspect preservation in an image–text encoder, using resized learned positional embeddings and attention masking.
Impact: Especially strong on OCR/document/screen benchmarks at low token budgets, where square-resize distortion is most harmful (Fig. 3; Table 7).
Multilingual encoder with de-biasing that retains English strength (Sec. 2.1; Sec. 3.1; Fig. 2; Tables 1, 8–9)
Different from prior: Uses a multilingual tokenizer and 90/10 EN/non-EN WebLI mixture with explicit de-biasing filters.
Impact: Large multilingual retrieval gains over SigLIP and near parity with multilingual SigLIP (mSigLIP) while improving English tasks; representation bias drops dramatically (Fig. 2; Table 1; Table 9).
Active data curation for small models (Sec. 2.5; Table 1)
Different from prior: Reuses ACID with a specifically fine-tuned teacher to capture ACED-like benefits without explicit distillation loss.
Impact: B-sized models get outsized improvements, closing much of the gap at lower inference cost (Table 1, B/16 and B/32 rows).

Overall, the unification and scheduling choices are the fundamental innovations; NaFlex and active curation are impactful engineering contributions that broaden applicability and improve the cost–quality trade-off.

5. Experimental Analysis¶

Evaluation setup
Zero-shot classification and retrieval (Sec. 3.1; Table 1)
- Datasets: ImageNet-1k (val), ImageNet-v2, ImageNet ReaL, ObjectNet; COCO and Flickr retrieval; multilingual Crossmodal-3600 (XM3600) retrieval.
- Metric: Accuracy for classification; recall@1 for retrieval (percentage of queries where the correct match is ranked first).
NaFlex vs standard (Sec. 3.1.1; Fig. 3; Table 7)
- Compare single NaFlex checkpoint vs separate fixed-resolution checkpoints across sequence lengths.
- Add OCR/document/screen retrieval: TextCaps, HierText, SciCap, Screen2Words.
VLM transfer (Sec. 3.2; Fig. 4; Table 6)
- Pair encoders with Gemma 2 (2B) LLM; Stage 1: 50M examples of a rich multi-task mix with frozen vision encoder; Stage 3: dataset-specific fine-tuning.
- Resolutions: 224/256 tokens and 384px settings.
Dense prediction probing (Sec. 3.3.1; Table 2)
- Tasks: semantic segmentation (PASCAL, ADE20k), monocular depth (NYUv2, NAVI), surface normals (NYUv2, NAVI).
- Protocol: linear head or DPT decoder on frozen features; use MAP-pooled embedding in place of a CLS token as in [38].
Open-vocabulary segmentation (Sec. 3.3.2; Table 3)
- Framework: Cat-Seg; train on COCO-Stuff-164k; test on ADE20k (847/150 classes), Pascal Context (459/59), Pascal VOC (20/21). Metric: mIoU.
Referring expression comprehension (Sec. 3.4.1; Table 5)
- Attach a 6-layer cross-attention decoder to frozen vision tokens; train on all RefCOCO variants; metric: Acc@0.5 (IoU threshold).
Open-vocabulary detection (Sec. 3.4.2; Table 4)
- OWL-ViT fine-tuning for COCO and LVIS; metrics: AP (and rare-category AP on LVIS).
Cultural diversity and fairness (Sec. 3.5; Fig. 5–6; Tables 8–9)
- Cultural: Dollar Street, GeoDE, GLDv2; report 0-shot and 10-shot accuracies (geographical diversity, geolocalization).
- Fairness: “Representation bias” (tendency to associate random objects with one gender) and “disparity” (max difference in accuracy across Dollar Street income groups).
Main results (highlights with grounded references)
Zero-shot and retrieval (Table 1)
- B/16, 256 tokens: ImageNet val 79.1% vs 76.7% (SigLIP), ReaL 85.4 vs 83.1; COCO R@1 T→I 53.2 vs 47.4; XM3600 R@1 T→I 40.7 vs 22.5.
- L/16, 256: ImageNet val 82.5 vs 80.5; ObjectNet 78.8 vs 76.8; XM3600 T→I 46.5 vs 30.9.
- Across sizes and resolutions, SigLIP 2 outperforms OpenCLIP, MetaCLIP, EVA-CLIP, and DFN in most settings despite being multilingual.
Multilingual retrieval (Fig. 2; Table 1)
- Per-language XM3600: SigLIP 2 nearly matches multilingual SigLIP (mSigLIP) while greatly exceeding English-centric SigLIP, e.g., average XM3600 R@1 T→I improves from 22.5 (SigLIP B/16-256) to 40.7 (SigLIP 2), approaching mSigLIP’s 50.0 at So/16-256 (Fig. 2; Table 1).
NaFlex vs standard (Fig. 3; Table 7)
- On OCR/document/screen retrieval at small sequence lengths, NaFlex is stronger (e.g., B/16 TextCaps R@1 at 256 tokens: 19.7/17.1 I→T/T→I for NaFlex vs 17.1/14.2 standard; Table 7).
- On natural-image classification/retrieval, standard B-sized checkpoints often edge out NaFlex (likely due to the extra self-distillation stage; Fig. 3). NaFlex “interpolates fairly well” between trained lengths but “does not extrapolate well” to unseen lengths (Fig. 3 caption).
VLM transfer (Fig. 4; Table 6)
- With Gemma 2 (2B), SigLIP 2 beats SigLIP and AIMv2 across model sizes and resolutions.
- Examples at So400m/14, 384px: TextVQA val +4.3 points (69.7→74.0), ST-VQA +2.3 (75.0→77.3), DocVQA +3.2 (62.7→65.9), SciCap +2.1 (177.2→179.3); RefCOCO testA +1.6 (76.6→78.2). Many tasks show consistent but smaller gains (Fig. 4; Table 6).
Dense prediction probing (Table 2)
- So/14, 384px: PASCAL mIoU 78.1 vs 73.8 (SigLIP); ADE20k mIoU 45.4 vs 40.8; NYUv2 depth RMSE 0.466 vs 0.563 (lower is better); normals RMSE 23.0 vs 24.1. Even So/14, 224px shows strong gains.
Open-vocabulary segmentation (Table 3)
- L/16: On ADE20k-150, 38.8 mIoU vs 37.5 (SigLIP) and 36.2 (OpenCLIP G/14). Gains also on Pascal Context and VOC.
Referring expression comprehension (Table 5)
- Huge jumps at all sizes: B/16, 256 tokens RefCOCO val 83.76 vs 64.05; L/16, 256 86.04 vs 67.33; So/14, 256 86.42 vs 64.68. Only LocCa (English-only training) slightly exceeds SigLIP 2 in some splits.
Open-vocabulary detection (Table 4)
- OWL-ViT fine-tuning: B/16 COCO AP 42.8 vs 42.2; LVIS AP 34.4 vs 33.0; rare categories 32.7 vs 31.0. So/14 COCO 45.2 vs 44.3; LVIS 40.5 vs 39.5; rare 42.3 vs 40.9.
Cultural diversity and fairness (Fig. 5–6; Tables 8–9)
- 10-shot geolocalization on GeoDE (region): L/16, 256 44.4% vs 36.2% (SigLIP). 0-shot Dollar Street: 55.2% vs 52.1%; GLDv2: 64.5% vs 56.7% (Table 8; Fig. 5).
- Representation bias drops sharply: L/16, 256 7.3% vs 35.5% (lower is better), with larger models generally fairer (Table 9; Fig. 6).
- Income/region disparity reductions are modest (Table 9 and Sec. 3.5).
Do the experiments support the claims?
Breadth and consistency: Strong, spanning global alignment (Table 1), multilingual (Fig. 2), local/dense (Tables 2–3, 5), detection (Table 4), VLM transfer (Fig. 4), and fairness (Fig. 6; Table 9).
Component attribution: While the full recipe clearly outperforms baselines, there are limited explicit ablations isolating the contribution of each new loss (e.g., turning off decoder or self-distillation). The timing (late-stage) and view separation are design choices substantiated by results on dense/local tasks without alignment regressions.
Robustness and trade-offs visible in results
NaFlex trades a bit of performance on natural images at B-size (vs standard) but wins on OCR/screen with fewer tokens; it interpolates well but not beyond trained lengths (Fig. 3; Table 7).
Referring expressions: SigLIP 2 is very strong but English-only LocCa is slightly ahead in some splits (Table 5), suggesting multilingual training may slightly dilute maximal English-only specialization on this task.

6. Limitations and Trade-offs¶

Compute and reproducibility
Pretraining on 40B examples with up to 2048 TPUv5e and large batch FSDP is resource-intensive (Sec. 2.1). Small-model improvements add 4B more examples (Sec. 2.5).
Component attribution
The paper does not provide fine-grained ablations for each added loss and scheduling choice across all tasks, making it hard to quantify per-component gains beyond qualitative rationale (Sec. 2).
NaFlex caveats (Sec. 3.1.1; Fig. 3)
No self-distillation/masked prediction during NaFlex adaptation (engineering simplification) may leave performance on natural images slightly behind standard B-size checkpoints.
Does not extrapolate well to untrained sequence lengths.
Decoder not shipped for inference
The decoder is only used during pretraining; tasks that fully exploit its cross-attention (e.g., referring expressions) may further improve if the pretrained decoder were retained (Table 5 discussion).
Multilingual scope and cultural fairness
Despite improvements, SigLIP 2 slightly trails mSigLIP on some multilingual retrieval languages (Fig. 2). Disparity by income/region shows only modest reductions (Table 9).
Data mixture choices
The 90/10 EN/non-EN split is a compromise; performance on truly low-resource languages or scripts not well represented in WebLI is not deeply examined.

7. Implications and Future Directions¶

Field impact
SigLIP 2 provides a new “default” open encoder for multimodal systems: stronger global alignment, better dense/local features, multilingual coverage, fairness improvements, and practical NaFlex inference.
For VLM builders, Fig. 4/Table 6 indicate that simply swapping the encoder lifts many downstream tasks without changing LLM or training recipes.
Practical applications
Cross-lingual retrieval/search; OCR-heavy tasks (TextVQA, documents, UI); open-vocabulary detection/segmentation for robotics and AR; geolocalization and diverse-object recognition; fairer and more culturally inclusive systems (Tables 1–4, 6–9; Figs. 3–6).
NaFlex checkpoints are especially attractive in production settings where images vary widely in aspect ratio and compute budgets fluctuate.
Research directions
Joint training of NaFlex with self-distillation/masking to close the small natural-image gap while retaining OCR/screen gains (Sec. 2.4.2; Fig. 3).
Detailed ablations to quantify the contribution and optimal scheduling/weighting of decoder vs self-distillation vs masked prediction, and their interactions.
Multilingual data mixtures beyond 90/10, targeted augmentation for low-resource scripts, and fairness evaluations across additional sensitive attributes and benchmarks (Sec. 3.5).
Retaining or lightweightly adapting the pretrained decoder for localization tasks at inference time (Table 5 suggests headroom vs LocCa).
Extending active data curation beyond small models, or online curation that adapts to task-specific domains.

Bottom line (Sec. 5): SigLIP 2 demonstrates that carefully combining sigmoid alignment, decoder-based grounding, and late-stage local feature learning—plus pragmatic advances in aspect-ratio handling, multilingual tokenization, and data curation—produces a single family of encoders that advance the state of the art across classification, retrieval, VLM transfer, localization, dense prediction, and fairness (Tables 1–6; Figs. 2–6).