DINOv3

DINOv3: Unified Global & Local Self-Supervision

DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction. This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline.

This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository).

If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section.

1. Student–Teacher Architecture

As in DINOv2, DINOv3 uses a student–teacher setup:

a student network with parameters $ \theta $
a teacher network with parameters $ \xi $

Both networks receive different augmented views of the input image $x$:

$$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$

The student learns by matching the teacher, while the teacher is a momentum-averaged version of the student.

2. Image & Patch-Level Outputs

DINOv3 produces two kinds of outputs from the Vision Transformer:
global (CLS token) and local (patch tokens).  
These two branches form the core of DINOv3’s unified objective.

2.1 Types of Output

A Vision Transformer outputs a sequence:

$$ \text{ViT}(x) = \Big[ \underbrace{\text{CLS}}_{z},\; \underbrace{h(1), h(2), \ldots, h(N)}_{\text{patch tokens}} \Big] $$

Where:

$ z $: global embedding (CLS token)
$ h(i) $: patch embedding for the $i$-th patch

DINOv3 learns from both of these:

Global features → for self-distillation (same as DINOv2)
Local patch features → for masked reconstruction

3. Global Embeddings (CLS Token)

The CLS token produces global embeddings:

$$ z_s = f_\theta(x_s), \qquad z_t = f_\xi(x_t) $$ $$ q_s = g_\theta(z_s), \qquad q_t = g_\xi(z_t) $$

Converted into probability distributions:

$$ p_s = \text{Softmax}\!\left(\frac{q_s}{\tau_s}\right) $$ $$ p_t = \text{Softmax}\!\left(\frac{q_t - c}{\tau_t}\right) $$

Where:

\( \tau_s \): student temperature (higher → smoother)

\( \tau_t \): teacher temperature (lower → sharper)

\( c \): centering vector to prevent collapse

Student matches the teacher’s global distribution (stop-gradient on teacher)

4. Patch-Level Embeddings (Local Tokens)

Each image is divided into $N$ patches. For each patch $i$:

$$ h_s(i), \qquad h_t(i) \in \mathbb{R}^d $$

DINOv3 introduces masking: the student receives a masked image $x_M$, while the teacher sees the full image:

$$ h_t(i) = \text{TeacherPatch}(i) $$ $$ \hat{h}_s(i) = \text{StudentPatch}(i \;|\; x_M) $$

The student must predict masked patch embeddings from the teacher.

5. Global Loss (DINO-Style)

The global DINOv3 loss is identical to DINOv2:

$$ \mathcal{L}_{\text{global}} = - \sum_{k=1}^{K} p_t^{(k)} \log p_s^{(k)} $$

6. Masked Patch Reconstruction Loss

Let $M$ be the set of masked patch indices. The student predicts $\hat{h}_s(m)$ while the teacher provides $h_t(m)$.

L2 Loss

$$ \mathcal{L}_{\text{recon}}^{\ell_2} = \sum_{m \in M} \| \hat{h}_s(m) - h_t(m) \|_2^2 $$

Cosine Similarity Loss

$$ \mathcal{L}_{\text{recon}}^{\cos} = \sum_{m \in M} \left( 1 - \frac{\hat{h}_s(m) \cdot h_t(m)}{\|\hat{h}_s(m)\|\, \|h_t(m)\|} \right) $$

Patch-level learning → local structure, shapes, boundaries, texture.

7. Combined DINOv3 Loss

The full loss is a weighted combination:

$$ \mathcal{L}_{\text{DINOv3}} = \lambda_{\text{global}} \mathcal{L}_{\text{global}} + \lambda_{\text{recon}} \mathcal{L}_{\text{recon}} $$

8. Teacher Update: EMA

The teacher parameters evolve as a momentum average:

$$ \xi \leftarrow m \cdot \xi + (1 - m)\cdot \theta $$

9. Why DINOv3 Is More Powerful

Global features from self-distillation
Local features from masked patch prediction
Better for semantic segmentation
Improved depth, 3D understanding, and correspondence
Still fully self-supervised

References

Siméoni, O., Vo, H. V., Seitzer, M., et al. (2025). DINOv3. arXiv preprint, arXiv:2508.10104.

License & Attribution

This blog includes images and media from the DINOv3 GitHub repository, which is licensed under the Apache License 2.0.

You must cite the original work if you use DINOv3 in research:

@misc{simeoni2025dinov3,
  title={{DINOv3}},
  author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
  year={2025},
  eprint={2508.10104},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.10104},
}

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention. 1. Image to Patch Embeddings The input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$ is divided into non-overlapping patches of size $ P \times P $, giving a total of $$ N = \frac{H \times W}{P^2} $$ patches. Each patch $ \mathbf{x}^{(i)} $ is flattened and linearly projected into a $ D $-dimensional embedding: $$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$ After stacking all patch embeddings, we form: $$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathb...

Explore more 🐱‍🏍

Learning to Learn

Search This Blog