Skip to main content

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation

Vision Transformer (ViT)

Vision Transformer Diagram

The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention.

1. Image to Patch Embeddings

The input image:

$$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$

is divided into non-overlapping patches of size \( P \times P \), giving a total of

$$ N = \frac{H \times W}{P^2} $$

patches. Each patch \( \mathbf{x}^{(i)} \) is flattened and linearly projected into a \( D \)-dimensional embedding:

$$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$

After stacking all patch embeddings, we form:

$$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathbf{e}^{(N)}]^\top \in \mathbb{R}^{N \times D} $$
Notation:
\( H, W \): image height and width
\( C \): number of channels (e.g., 3 for RGB)
\( P \): patch size
\( N \): number of patches
\( D \): embedding dimension
\( \mathbf{W}_{\text{embed}} \): learnable patch projection matrix

2. Positional Embeddings

Transformers have no inherent notion of spatial order, so positional embeddings are added to retain the location of each patch:

$$ \mathbf{Z}_0 = \mathbf{E} + \mathbf{P} $$
\( \mathbf{P} \in \mathbb{R}^{N \times D} \): learnable positional embeddings

3. Multi-Head Self-Attention (MHSA)

The self-attention mechanism models global dependencies among all patches. For each head \( h = 1, \dots, H \):

$$ \mathbf{Q}_h = \mathbf{Z} \mathbf{W}_Q^{(h)}, \quad \mathbf{K}_h = \mathbf{Z} \mathbf{W}_K^{(h)}, \quad \mathbf{V}_h = \mathbf{Z} \mathbf{W}_V^{(h)} $$
$$ \text{Attention}_h = \text{Softmax}\!\left( \frac{\mathbf{Q}_h \mathbf{K}_h^\top}{\sqrt{d_k}} \right)\mathbf{V}_h $$
$$ \text{MHSA}(\mathbf{Z}) = [\text{Attention}_1; \dots; \text{Attention}_H] \mathbf{W}_O $$
Notation:
\( H \): number of attention heads
\( d_k = D / H \): dimension per head
\( \mathbf{W}_Q^{(h)}, \mathbf{W}_K^{(h)}, \mathbf{W}_V^{(h)} \): projection matrices for queries, keys, and values
\( \mathbf{W}_O \): output projection matrix

4. Feed-Forward Network (FFN)

Each Transformer block contains a position-wise FFN applied to every patch embedding independently:

$$ \text{FFN}(\mathbf{z}) = \text{GELU}(\mathbf{z}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2 $$

Each encoder layer applies Layer Normalization and residual connections:

$$ \mathbf{Z}' = \text{LayerNorm}(\mathbf{Z} + \text{MHSA}(\mathbf{Z})) $$ $$ \mathbf{Z}_{\text{next}} = \text{LayerNorm}(\mathbf{Z}' + \text{FFN}(\mathbf{Z}')) $$

5. Classification Head

A special learnable classification token \( \mathbf{z}_{\text{CLS}} \) is prepended to the patch embeddings. After the final Transformer layer, its output representation is used for classification:

$$ \text{logits} = \mathbf{z}_{\text{CLS}} \mathbf{W}_{\text{cls}} \in \mathbb{R}^C $$
\( C \): number of classes, \( \mathbf{W}_{\text{cls}} \): classification weight matrix.

6. Training Objective

$$ \mathcal{L} = - \sum_{i=1}^{C} y_i \log(\text{Softmax}(\text{logits})_i) $$
Cross-entropy loss is minimized to train the model for classification.

References

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint, arXiv:2010.11929.

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...