DINOv3¶

🎯 Pitch¶

DINOv3 introduces a new self-supervised vision foundation model that successfully scales Vision Transformers up to 7 billion parameters while preserving high-quality dense features—a breakthrough made possible by its innovative 'Gram anchoring' regularizer and scalable, metadata-free training pipeline. This enables a single frozen encoder to achieve state-of-the-art performance on dense computer vision tasks (like segmentation, depth, and tracking) while remaining competitive on global tasks, and crucially, it generalizes across diverse domains—including satellite imagery—without requiring fine-tuning, paving the way for robust, multi-task vision systems with dramatically lower deployment complexity.

1. Executive Summary¶

DINOv3 is a self-supervised vision foundation model and training recipe that scales a Vision Transformer (ViT) to 7B parameters while preserving high-quality dense features—solving a long-standing problem where large SSL models trained for long schedules lose local spatial consistency. It introduces a simple but powerful “Gram anchoring” regularizer, plus a robust data and training pipeline, to produce a frozen encoder that achieves state-of-the-art results on dense tasks (e.g., segmentation, depth, tracking) and competitive performance on global tasks, and it transfers across domains (including satellite imagery) without task-specific fine-tuning (Secs. 3–6; Figs. 2–4, 8–11; Tabs. 3–12, 17–19).

2. Context and Motivation¶

The problem addressed
Scaling self-supervised learning (SSL) in vision to large models and long training schedules typically improves global recognition but degrades dense features (local patch-level consistency), which are crucial for segmentation, depth, correspondence, and tracking (Sec. 1; Fig. 5b–c, Fig. 6).
Standard training practice also depends on pre-specified cosine schedules and curated metadata; both are difficult to set or obtain at scale without labels (Sec. 1; Sec. 3.1–3.2).
Why it matters
Dense features are the backbone of many high-value applications—autonomous driving, robotics, 3D vision, video understanding, medical and geospatial imaging—where pixel- or patch-level outputs are needed and annotations are scarce (Sec. 1; Sec. 2 “Dense Transformer Features”).
A frozen encoder that works off-the-shelf across tasks and domains can dramatically reduce compute and complexity in deployment, enabling multi-task systems with a single backbone (Sec. 1; Sec. 6.3).
What existed before and where it fell short
Weakly supervised (image–text) models such as CLIP families excel in global classification and zero-shot transfer but typically produce noisy or lower-quality dense features (Tab. 3, Tab. 5; Fig. 13).
Self-supervised models like DINOv2 offered strong dense features but did not scale cleanly: longer training and larger models led to patch-level degradation (Sec. 1; Fig. 5a–c, Fig. 6; also discussed by Fan et al., 2025).
“Agglomerative” dense models distilling from supervised segmenters (e.g., SAM) achieve strong dense outputs but rely on annotation-heavy teachers and still lag on some domains/tasks (Fig. 2; Tab. 3).
How this work positions itself
DINOv3 is an SSL-only pipeline that (1) scales model capacity and data with careful curation and constant hyperparameters; (2) introduces Gram anchoring to stabilize dense features during long training; (3) adds high-resolution post-training and efficient single-teacher/multi-student distillation; (4) extends to text alignment and a satellite domain variant—all while keeping the backbone frozen for most evaluations (Sec. 3–5; Sec. 6–8).

3. Technical Approach¶

This section explains the end-to-end recipe: data, model/losses, the Gram anchoring mechanism, post-training, and distillation.

Data creation and sampling (Sec. 3.1; Tab. 1)
Three data parts from a 17B-image pool: 1) A clustering-curated balanced subset LVD-1689M using hierarchical k-means over DINOv2 embeddings (Vo et al., 2024). 2) A retrieval-curated subset “close” to seed datasets (as in DINOv2). 3) Raw public CV datasets (ImageNet-1k/22k, Mapillary SLS).
Sampling strategy: in each iteration, either a homogeneous ImageNet-1k batch (10% of iterations) or a heterogeneous batch mixing the other components—balancing “high-quality focused” and “broad diverse” data (Sec. 3.1).
Ablation (Tab. 1): no single curation approach wins everywhere; the mixed pipeline gives the best average performance (e.g., IN1k-Linear 87.2, Paris Retrieval 85.9).
Model and SSL losses (Sec. 3.2; Tab. 2; Eq. 1)
Architecture: a custom ViT-7B with 40 blocks, 4096-d embeddings, 32×128 heads, a patch size of 16, and 4 register tokens (extra learned tokens that absorb global interactions and mitigate patch outliers; Darcet et al., 2024). Uses rotary positional embeddings (RoPE) with “box jitter” to improve robustness across scales and aspect ratios (Tab. 2).
Multi-crop SSL with two global and eight local crops; batch size 4096 across 256 GPUs; constant learning rate/weight decay/momentum (no cosine schedule) with warmup only, enabling indefinite training (Sec. 3.2).
Pretraining objective (Eq. 1): L_Pre = L_DINO + L_iBOT + 0.1 * L_DKoleo
- L_DINO: global discrimination via self-distillation with Sinkhorn-Knopp centering (like SwAV).
- L_iBOT: patch-level latent reconstruction objective.
- L_DKoleo: spreads features uniformly (Sablayrolles et al., 2018), implemented across GPUs in small batches.
- Dedicated heads and layer norms for global and local branches stabilize training (Sec. 3.2).
The observed failure mode: dense feature drift (Sec. 4.1; Fig. 5–6)
With scale and long training, classification keeps improving but dense performance declines (Fig. 5b–c).
Cosine similarity between CLS and patch tokens steadily increases, causing features to become less localized; similarity maps get noisier (Fig. 5a, Fig. 6).
Register tokens fix high-norm patch outliers (Appendix A.1; Fig. 20) but do not prevent the “similarity drift” that harms dense tasks.
Gram anchoring: stabilize local structure without freezing features (Sec. 4.2; Eq. 2–3; Fig. 7–10)
Key idea: constrain the student’s pairwise patch similarities (its Gram matrix) to match those from an earlier “Gram teacher” checkpoint that still has clean dense structure.
- Define Gram matrix as all pairwise dot-products between L2-normalized patch features within a crop.
- Loss (Eq. 2): L_Gram = || X_S X_S^T – X_G X_G^T ||_F^2, where X_S are student patch features and X_G are the teacher’s.
Training schedule:
- Run standard SSL for ~1M iterations; then start a refinement phase with L_Ref = w_D L_DINO + L_iBOT + w_DKL L_DKoleo + w_Gram L_Gram (Eq. 3).
- Update the Gram teacher every 10k iterations to the current EMA teacher for a few steps (Sec. 4.2; Fig. 7).
Effects:
- Immediate dense gains within ~10k iterations (Fig. 8), while global metrics remain stable or slightly improve.
- L_iBOT decreases faster under Gram anchoring, suggesting synergy between local reconstruction and similarity-structure constraints (Fig. 7a–c).
High-resolution Gram and high-resolution adaptation (Sec. 4.3; Sec. 5.1; Figs. 9–11)
Compute teacher features at 2× input resolution, then bicubic downsample features to the student resolution; use these smoothed high-res features for L_Gram (“LHRef” in Fig. 8–10).
- Qualitatively preserves fine structure in downsampled cosine maps (Fig. 9a).
- Quantitatively adds +2 mIoU on ADE20k over the non-HR Gram (Tab. in Fig. 9b).
Resolution adaptation (10k iterations) with mixed global/local crop sizes and Gram anchoring makes features robust up to very high resolutions (4K+)—improving dense tasks at high res while keeping global performance stable (Fig. 11; Fig. 4).
Efficient multi-student distillation (Sec. 5.2; Fig. 12; Fig. 16; Tab. 14–15)
Distill the 7B teacher into a family of smaller ViTs (S, S+, B, L, H+) and ConvNeXts (T, S, B, L).
A single-teacher/multi-student pipeline shares expensive teacher inference across all GPUs, then trains each student in its own synchronized group; group sizes are tuned so all students step in lockstep (Fig. 12).
Results: large students (e.g., ViT-H+) match the 7B teacher closely (Fig. 16b); ConvNeXt students inherit robust dense and OOD performance even though they are convolutional (Tab. 15).
Optional text alignment (Sec. 5.3; Tab. 16)
A lightweight LiT-style head trains a text encoder from scratch to match a frozen DINOv3 image encoder; concatenates mean-pooled patch embeddings with the CLS token to align both global and local signals.
Delivers competitive zero-shot classification/retrieval and strong open-vocabulary segmentation vs similarly sized models (Tab. 16).

4. Key Insights and Innovations¶

Gram anchoring for dense-feature stability (fundamental)
What’s new: regularize the student’s patch-similarity structure (Gram matrix) toward an earlier “clean” iteration of itself—preserving local consistency while allowing global features to continue improving (Sec. 4.2; Eq. 2–3).
Why it matters: it “repairs” dense features after long training at scale—immediate mIoU gains and visibly cleaner similarity maps (Figs. 8–10)—solving a central scalability barrier for SSL.
High-resolution Gram + resolution adaptation (fundamental + practical)
What’s new: compute teacher features at 2× resolution and distill their smoothed structure into the student; then run a short high-resolution mixed-crop training with Gram anchoring (Sec. 4.3; Sec. 5.1).
Why it matters: dense features remain crisp and semantically coherent up to very high resolutions (Figs. 4, 11), which is critical for dense tasks and large-scene imagery.
Constant-schedule large-scale SSL with robust ViT-7B design (incremental but enabling)
What’s new: constant LR/weight-decay/momentum after warmup, axial RoPE with “box jitter,” registers for outlier control, and a scalable multi-crop setup (Sec. 3.2; Tab. 2; Appendix A).
Why it matters: simplifies long, indefinite training and supports scaling to 7B parameters without dense-feature collapse (once Gram anchoring is applied).
Efficient multi-student distillation across architectures (practical innovation)
What’s new: a system to distill many students in parallel by sharing teacher inference, minimizing idle time with synchronized groups (Sec. 5.2; Fig. 12).
Why it matters: turns an expensive 7B teacher into a practical model family (ViT and ConvNeXt) that preserves the teacher’s dense and OOD strengths (Fig. 16; Tabs. 14–15).
Domain-general SSL that transfers to satellites (demonstration of generality)
What’s new: the same recipe directly applies to a 493M-image satellite dataset, yielding state-of-the-art canopy height mapping and strong results on GEO-Bench and high-res remote sensing datasets (Sec. 8; Tabs. 17–19; Fig. 18–19).
Why it matters: shows a label-free path for high-impact scientific and industrial domains with scarce metadata.

5. Experimental Analysis¶

Evaluation design: frozen encoders + light heads wherever possible (Sec. 6)
Dense linear probing: semantic segmentation (ADE20k/Cityscapes/VOC; mIoU) and depth (NYUv2/KITTI; RMSE) with a single linear layer on patch features (Sec. 6.1.2; Tab. 3).
Non-parametric dense tasks: 3D correspondences (NAVI/SPair; recall), unsupervised object discovery (TokenCut; CorLoc), video segmentation tracking (DAVIS/YouTube-VOS/MOSE; J&F) (Secs. 6.1.3–6.1.5; Tab. 4, Fig. 14, Tab. 5).
Video classification: attentive probe (4-layer transformer) on extracted per-frame patch features (UCF101/SSv2/Kinetics-400; top-1) (Sec. 6.1.6; Tab. 6).
Global probes: ImageNet and OOD variants (V2, ReaL, Renditions, Sketch, A, C (mCE), ObjectNet), fine-grained (Places205, iNat18/21, 12 small datasets) (Sec. 6.2; Tabs. 7–8, 22).
Instance retrieval: Oxford/Paris (mAP), Met (GAP), AmsterTime (mAP) (Sec. 6.2.2; Tabs. 9, 23).
Strong decoders on top of frozen backbone: detection (Plain-DETR on COCO/COCO-O), segmentation (ViT-Adapter + Mask2Former on ADE20k/COCO-Stuff/VOC/Cityscapes), depth (DPT in DAv2 pipeline), 3D (swap DINOv3 into VGGT) (Sec. 6.3; Tabs. 10–13, 24).
Main results (selected highlights)
Dense linear probes (Tab. 3): > ADE20k mIoU: DINOv3 55.9 vs DINOv2 49.5 (+6.4), AM-RADIO 53.0, PEspatial 49.3
> Cityscapes mIoU: DINOv3 81.1 vs DINOv2 75.6 (+5.5)
> NYUv2 RMSE↓: DINOv3 0.309 vs DINOv2 0.372 (better), PEspatial 0.362
> KITTI RMSE↓: DINOv3 2.346 vs DINOv2 2.624 (better)
3D correspondences (Tab. 4): > NAVI recall: DINOv3 64.4 vs DINOv2 60.1; SPair: 58.7 vs 56.1
Unsupervised object discovery (Fig. 14): > VOC07/12/COCO CorLoc: DINOv3 66.1 / 69.5 / 55.1 (best among compared models)
Video segmentation tracking (Tab. 5): > DAVIS J&F (L): DINOv3 83.3 vs DINOv2 76.6; YouTube-VOS (L): 80.7 vs 74.6; MOSE (L): 55.6 vs 48.5
Video classification (Tab. 6): > SSv2 Single: DINOv3 70.1 vs DINOv2 67.4; K400 Single: 87.8 vs 84.4; near SigLIP 2 / PEcore
Global linear probes (Tab. 7): > IN-1k val: DINOv3 88.4; OOD: ReaL 90.4, Rendition 91.1, Sketch 71.3, A 86.9, C mCE 19.6 (best), ObjectNet 79.0
Performance is close to or better than comparable weakly supervised models on several OOD sets.
Fine-grained classification (Tab. 8; Tab. 22): > iNat21: DINOv3 89.8 (best); Fine-S average: 93.0, competitive with PEcore 94.5
Instance retrieval (Tabs. 9, 23): > Oxford-H / Paris-H mAP: 60.7 / 87.1 (best)
> Met GAP: 55.4 (+10.8 over DINOv2)
> AmsterTime mAP: 56.5 (+7.6 over DINOv2)
Detection with frozen backbone (Tab. 10): > COCO mAP: 66.1 (TTA), state of the art among listed systems despite training only ~100M decoder parameters (backbone frozen)
> COCO-O mAP 66.4, ER 36.8 (highest ER listed)
Segmentation with frozen backbone (Tab. 11; Tab. 24): > ADE20k mIoU 63.0 (ties ONE-PEACE) at 896px; improves previous SOTA on COCO-Stuff/VOC among models listed, with frozen backbone
Depth (Tab. 12): > New SOTA on NYUv2, KITTI, ETH3D, ScanNet using frozen DINOv3 in a DPT head within DAv2 pipeline (e.g., NYUv2 ARel 4.3, δ1 98.0)
> On DIODE, ARel (25.6) lags behind DPT (18.2), though δ1 (82.2) is higher than some baselines—mixed result on that dataset
3D with VGGT (Tab. 13): > Improves over VGGT with DINOv2 across camera pose estimation (Re10K/CO3Dv2), DTU multi-view depth (lower overall error), and ScanNet-1500 matching (higher AUC)
Model family (Fig. 16; Tab. 14–15): > ViT-H+ nearly matches the 7B teacher while using ~1/10th the parameters (Fig. 16b)
> ConvNeXt students: large gains on OOD and dense tasks vs supervised ConvNeXts, especially at higher input resolutions (Tab. 15)
Text alignment (Tab. 16): > DINOv3-based dino.txt improves open-vocabulary segmentation notably (ADE20k 24.7, Cityscapes 36.9), competitive zero-shot classification/retrieval for its size
Geospatial transfer (Tabs. 17–19; Fig. 18–19): > Canopy height mapping SOTA MAE: Sat 7B = 2.2 (val), 3.2 (test); Open-Canopy MAE 2.02 (best) (Tab. 17)
> GEO-Bench: web 7B model tops mean scores across classification and segmentation even though it uses only RGB (Tab. 18)
> LoveDA/iSAID/DIOR: web 7B sets new or near-best marks with frozen backbone and standard decoders (Tab. 19)
Ablations, diagnostics, and robustness checks
Data curation ablation (Tab. 1): mixed curation is best overall.
Gram teacher ablation (Fig. 9b): early teachers (~100k–200k iters) work best; too-late teachers (1M) underperform, consistent with dense degradation.
Loss dynamics (Fig. 7): Gram anchoring primarily accelerates the patch-local objective (iBOT), minimal interference with global DINO.
Outlier analysis (Appendix A; Fig. 20): registers remove high-norm patch outliers better than attention/value-bias tricks; feature-dimension outliers are mostly neutralized by final layer norm.
Layer-wise utility (Appendix B.2; Fig. 21): depth/tracking/3D often peak around layer ~32, while classification/segmentation generally improve toward the last layer.
Do the experiments support the claims?
Yes on dense tasks: consistent gains across linear probes, non-parametric tasks, and strong decoders indicate genuinely higher-quality local features (Fig. 2; Tabs. 3–6, 10–13).
Yes on global tasks: DINOv3 is competitive with leading weakly supervised models without any labels, and excels on some robustness metrics (Tab. 7–9).
Domain generality is credible: the satellite model and even the web model perform strongly in geospatial settings (Sec. 8; Tabs. 17–19; Fig. 18–19).

6. Limitations and Trade-offs¶

Training cost and complexity
Training a 7B SSL ViT with 1M+ iterations is compute-intensive; the total project footprint is ~9M GPU hours (~2600 tCO2eq under the stated assumptions) even if a single 7B run is ~18 tCO2eq (Sec. 9; Tab. 20).
Although the backbone is frozen for most downstream tasks, the initial pretraining remains resource-heavy.
Dependence on early checkpoints for Gram anchoring
Gram anchoring requires a suitable “early teacher” checkpoint (best around 100–200k iterations; Fig. 9b). If those are missing or if pretraining starts already “too late,” dense repair is less effective.
Mixed results on OCR-heavy classification and specific datasets
On OCR-centric benchmarks, DINOv3 trails weakly supervised models trained on image–text pairs (Tab. 25). On DIODE depth (ARel), DINOv3+DPT underperforms DPT, though other metrics are strong (Tab. 12).
Architectural choices constrain granularity
Patch size 16 means native token grids can still be relatively coarse at low input resolution. The work partly offsets this via high-resolution adaptation and decoders (Sec. 5.1; Sec. 6.3), but patch size is a design trade-off.
Assumptions and scope
The training relies on large-scale web imagery (Instagram-derived pool with moderation) and curation heuristics; while transparent and ablated (Sec. 3.1; Tab. 1), this may not cover rare or sensitive domains.
Constant-schedule optimization is a design choice; while it simplifies indefinite training, it may not always be optimal for every setting or hardware profile.

7. Implications and Future Directions¶

Field-level impact
Demonstrates that SSL can match or surpass weakly/supervised pipelines on dense vision while remaining highly competitive on global tasks, with a single frozen backbone (Fig. 2; Tabs. 3–7, 10–11).
Provides a general recipe—data curation + robust architecture + Gram anchoring + HR polishing + efficient distillation—that can become a standard for training large-scale SSL vision encoders.
Research directions
Beyond patch-size limits: explore native high-resolution tokenization or multi-scale tokenization that integrates with Gram anchoring (cf. Fig. 4, 11).
Adaptive or learned Gram teachers: instead of fixed iteration snapshots, learn when and how to update the Gram teacher or combine multiple teachers.
Cross-modal SSL (not weakly supervised): use Gram-like constraints to stabilize local correspondence across modalities (e.g., video/audio), or to guide generative pretraining without text labels.
Robustness and fairness: the fairness table (Tab. 26) shows DINOv3 reduces some regional gaps vs DINOv2, but further work could equalize performance across income regions and cultural distributions.
Practical applications
Off-the-shelf dense backbone for robotics, AR/VR, autonomous systems, and 3D mapping without fine-tuning (Tabs. 3–6, 10, 13).
Scientific imaging (medical, astronomical, geospatial): high-resolution dense embeddings with no labels (Sec. 8; Tabs. 17–19; Fig. 18–19).
Multi-task inference at the edge: distillations (ViT-S/B/L/H+, ConvNeXt T–L) retain dense/global performance under tight FLOPs/latency budgets (Fig. 16; Tabs. 14–15).
Open-vocabulary workflows: DINOv3-based dino.txt provides strong dense alignment for zero-shot segmentation and competitive global zero-shot (Tab. 16).

In short, DINOv3 contributes a principled, scalable way to keep self-supervised dense features “clean” while pushing model and data scale, enabling a single frozen encoder to underpin a broad suite of vision tasks across domains. The central mechanism—Gram anchoring—addresses a core scalability failure mode and should be broadly useful wherever local feature structure matters.