Skip to main content

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning

DINOv2: Self-Distillation for Vision Without Labels

DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings.

1. Student–Teacher Architecture

DINOv2 uses two networks:

  • a student network with parameters \( \theta \)
  • a teacher network with parameters \( \xi \)

Both networks receive different augmented views of the same image.

$$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$

The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student.

2. Image Embeddings

The student and teacher networks (often Vision Transformers) produce output embeddings:

$$ z_s = f_\theta(x_s), \qquad z_t = f_\xi(x_t) $$

These representations are then projected by small MLP heads (projection heads) to produce logits:

$$ q_s = g_\theta(z_s), \qquad q_t = g_\xi(z_t) $$

The logits are converted into probability distributions using temperature-scaled softmax.

3. Teacher and Student Distributions

$$ p_s = \text{Softmax}\!\left(\frac{q_s}{\tau_s}\right) $$ $$ p_t = \text{Softmax}\!\left(\frac{q_t - c}{\tau_t}\right) $$
Where:
\( \tau_s \): student temperature (higher → smoother)
\( \tau_t \): teacher temperature (lower → sharper)
\( c \): centering vector to prevent collapse

The teacher’s output is sharper to provide strong training targets, and the centering vector stabilizes training.

4. DINO Loss: Cross-Entropy Between Student and Teacher

The main training objective is to make the student match the teacher:

$$ \mathcal{L} = - \sum_{i=1}^{K} p_t^{(i)} \log p_s^{(i)} $$
\(K\): projection dimension (size of softmax output)
The teacher output is stop-gradient, so no gradient flows into the teacher.

This self-distillation forces the model to develop rich, semantically coherent features.

5. Teacher Update: Exponential Moving Average (EMA)

The teacher is never directly optimized. Instead, it is updated as a smoothed version of the student:

$$ \xi \leftarrow m \cdot \xi + (1-m) \cdot \theta $$
\( m \): momentum term (typically 0.996–0.999)

This stabilizes training and prevents collapse.

6. Normalized Embeddings

DINOv2 normalizes embeddings to lie on a hypersphere:

$$ \hat{z} = \frac{z}{\|z\|} $$

This is crucial for usable, consistent representations for:

  • image retrieval
  • clustering
  • semantic search

7. Advanced Multi-Crop Augmentation

DINOv2 uses multi-view augmentations:

  • Global crops: large views
  • Local crops: small, zoomed-in views
$$ \{x_s^i\}_{i=1}^{M_s}, \qquad \{x_t^j\}_{j=1}^{M_t} $$

The student sees all crops; the teacher sees only global crops.

8. Final Representation Quality

After training, the backbone’s output embeddings are directly used for downstream tasks. DINOv2 achieves strong performance even without finetuning.

9. Summary of Key Ideas

  • No labels needed (self-supervised)
  • Student tries to match teacher outputs
  • Teacher updated via EMA (not trained directly)
  • Temperature scaling + centering prevent collapse
  • Multi-crop augmentation enhances invariance
  • Produces state-of-the-art visual embeddings

References

Oquab, M., Darcet, T., Moutakanni, T., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint, arXiv:2304.07193.

License & Attribution

This blog includes video/media from the DINOv2 GitHub repository, which is licensed under the Apache License 2.0.

You must cite the original work if you use DINOv2 in research:

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Théo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention. 1. Image to Patch Embeddings The input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$ is divided into non-overlapping patches of size \( P \times P \), giving a total of $$ N = \frac{H \times W}{P^2} $$ patches. Each patch \( \mathbf{x}^{(i)} \) is flattened and linearly projected into a \( D \)-dimensional embedding: $$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$ After stacking all patch embeddings, we form: $$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathb...