DINOv3: Unified Global & Local Self-Supervision
DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction. This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline.
If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section.
1. Student–Teacher Architecture
As in DINOv2, DINOv3 uses a student–teacher setup:
- a student network with parameters \( \theta \)
- a teacher network with parameters \( \xi \)
Both networks receive different augmented views of the input image \(x\):
The student learns by matching the teacher, while the teacher is a momentum-averaged version of the student.
2. Image & Patch-Level Outputs
2.1 Types of Output
A Vision Transformer outputs a sequence:
Where:
- \( z \): global embedding (CLS token)
- \( h(i) \): patch embedding for the \(i\)-th patch
DINOv3 learns from both of these:
- Global features → for self-distillation (same as DINOv2)
- Local patch features → for masked reconstruction
3. Global Embeddings (CLS Token)
The CLS token produces global embeddings:
Converted into probability distributions:
\( \tau_s \): student temperature (higher → smoother)
\( \tau_t \): teacher temperature (lower → sharper)
\( c \): centering vector to prevent collapse
Student matches the teacher’s global distribution (stop-gradient on teacher)
4. Patch-Level Embeddings (Local Tokens)
Each image is divided into \(N\) patches. For each patch \(i\):
DINOv3 introduces masking: the student receives a masked image \(x_M\), while the teacher sees the full image:
The student must predict masked patch embeddings from the teacher.
5. Global Loss (DINO-Style)
The global DINOv3 loss is identical to DINOv2:
6. Masked Patch Reconstruction Loss
Let \(M\) be the set of masked patch indices. The student predicts \(\hat{h}_s(m)\) while the teacher provides \(h_t(m)\).
L2 Loss
Cosine Similarity Loss
7. Combined DINOv3 Loss
The full loss is a weighted combination:
8. Teacher Update: EMA
The teacher parameters evolve as a momentum average:
9. Why DINOv3 Is More Powerful
- Global features from self-distillation
- Local features from masked patch prediction
- Better for semantic segmentation
- Improved depth, 3D understanding, and correspondence
- Still fully self-supervised
References
Siméoni, O., Vo, H. V., Seitzer, M., et al. (2025). DINOv3. arXiv preprint, arXiv:2508.10104.
License & Attribution
This blog includes images and media from the DINOv3 GitHub repository, which is licensed under the Apache License 2.0.
You must cite the original work if you use DINOv3 in research:
@misc{simeoni2025dinov3,
title={{DINOv3}},
author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
year={2025},
eprint={2508.10104},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.10104},
}



Comments
Post a Comment