Skip to main content

DINOv3

DINOv3: Unified Global & Local Self-Supervision

DINOv3: Unified Global & Local Self-Supervision

DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction. This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline.

This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository).

If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section.

1. Student–Teacher Architecture

As in DINOv2, DINOv3 uses a student–teacher setup:

  • a student network with parameters \( \theta \)
  • a teacher network with parameters \( \xi \)

Both networks receive different augmented views of the input image \(x\):

$$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$

The student learns by matching the teacher, while the teacher is a momentum-averaged version of the student.

2. Image & Patch-Level Outputs

DINOv3 produces two kinds of outputs from the Vision Transformer: global (CLS token) and local (patch tokens). These two branches form the core of DINOv3’s unified objective.

2.1 Types of Output

A Vision Transformer outputs a sequence:

$$ \text{ViT}(x) = \Big[ \underbrace{\text{CLS}}_{z},\; \underbrace{h(1), h(2), \ldots, h(N)}_{\text{patch tokens}} \Big] $$

Where:

  • \( z \): global embedding (CLS token)
  • \( h(i) \): patch embedding for the \(i\)-th patch

DINOv3 learns from both of these:

  • Global features → for self-distillation (same as DINOv2)
  • Local patch features → for masked reconstruction

3. Global Embeddings (CLS Token)

The CLS token produces global embeddings:

$$ z_s = f_\theta(x_s), \qquad z_t = f_\xi(x_t) $$ $$ q_s = g_\theta(z_s), \qquad q_t = g_\xi(z_t) $$

Converted into probability distributions:

$$ p_s = \text{Softmax}\!\left(\frac{q_s}{\tau_s}\right) $$ $$ p_t = \text{Softmax}\!\left(\frac{q_t - c}{\tau_t}\right) $$
Where:
\( \tau_s \): student temperature (higher → smoother)
\( \tau_t \): teacher temperature (lower → sharper)
\( c \): centering vector to prevent collapse
Student matches the teacher’s global distribution (stop-gradient on teacher)

4. Patch-Level Embeddings (Local Tokens)

Each image is divided into \(N\) patches. For each patch \(i\):

$$ h_s(i), \qquad h_t(i) \in \mathbb{R}^d $$

DINOv3 introduces masking: the student receives a masked image \(x_M\), while the teacher sees the full image:

$$ h_t(i) = \text{TeacherPatch}(i) $$ $$ \hat{h}_s(i) = \text{StudentPatch}(i \;|\; x_M) $$

The student must predict masked patch embeddings from the teacher.

5. Global Loss (DINO-Style)

The global DINOv3 loss is identical to DINOv2:

$$ \mathcal{L}_{\text{global}} = - \sum_{k=1}^{K} p_t^{(k)} \log p_s^{(k)} $$

6. Masked Patch Reconstruction Loss

Let \(M\) be the set of masked patch indices. The student predicts \(\hat{h}_s(m)\) while the teacher provides \(h_t(m)\).

L2 Loss

$$ \mathcal{L}_{\text{recon}}^{\ell_2} = \sum_{m \in M} \| \hat{h}_s(m) - h_t(m) \|_2^2 $$

Cosine Similarity Loss

$$ \mathcal{L}_{\text{recon}}^{\cos} = \sum_{m \in M} \left( 1 - \frac{\hat{h}_s(m) \cdot h_t(m)}{\|\hat{h}_s(m)\|\, \|h_t(m)\|} \right) $$
Patch-level learning → local structure, shapes, boundaries, texture.

7. Combined DINOv3 Loss

The full loss is a weighted combination:

$$ \mathcal{L}_{\text{DINOv3}} = \lambda_{\text{global}} \mathcal{L}_{\text{global}} + \lambda_{\text{recon}} \mathcal{L}_{\text{recon}} $$

8. Teacher Update: EMA

The teacher parameters evolve as a momentum average:

$$ \xi \leftarrow m \cdot \xi + (1 - m)\cdot \theta $$

9. Why DINOv3 Is More Powerful

  • Global features from self-distillation
  • Local features from masked patch prediction
  • Better for semantic segmentation
  • Improved depth, 3D understanding, and correspondence
  • Still fully self-supervised

References

Siméoni, O., Vo, H. V., Seitzer, M., et al. (2025). DINOv3. arXiv preprint, arXiv:2508.10104.

License & Attribution

This blog includes images and media from the DINOv3 GitHub repository, which is licensed under the Apache License 2.0.

You must cite the original work if you use DINOv3 in research:

@misc{simeoni2025dinov3,
  title={{DINOv3}},
  author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
  year={2025},
  eprint={2508.10104},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.10104},
}

Comments

Popular posts from this blog

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention. 1. Image to Patch Embeddings The input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$ is divided into non-overlapping patches of size \( P \times P \), giving a total of $$ N = \frac{H \times W}{P^2} $$ patches. Each patch \( \mathbf{x}^{(i)} \) is flattened and linearly projected into a \( D \)-dimensional embedding: $$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$ After stacking all patch embeddings, we form: $$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathb...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...