Vision Transformer (ViT)
The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention.
1. Image to Patch Embeddings
The input image:
is divided into non-overlapping patches of size \( P \times P \), giving a total of
patches. Each patch \( \mathbf{x}^{(i)} \) is flattened and linearly projected into a \( D \)-dimensional embedding:
After stacking all patch embeddings, we form:
\( H, W \): image height and width
\( C \): number of channels (e.g., 3 for RGB)
\( P \): patch size
\( N \): number of patches
\( D \): embedding dimension
\( \mathbf{W}_{\text{embed}} \): learnable patch projection matrix
2. Positional Embeddings
Transformers have no inherent notion of spatial order, so positional embeddings are added to retain the location of each patch:
3. Multi-Head Self-Attention (MHSA)
The self-attention mechanism models global dependencies among all patches. For each head \( h = 1, \dots, H \):
\( H \): number of attention heads
\( d_k = D / H \): dimension per head
\( \mathbf{W}_Q^{(h)}, \mathbf{W}_K^{(h)}, \mathbf{W}_V^{(h)} \): projection matrices for queries, keys, and values
\( \mathbf{W}_O \): output projection matrix
4. Feed-Forward Network (FFN)
Each Transformer block contains a position-wise FFN applied to every patch embedding independently:
Each encoder layer applies Layer Normalization and residual connections:
5. Classification Head
A special learnable classification token \( \mathbf{z}_{\text{CLS}} \) is prepended to the patch embeddings. After the final Transformer layer, its output representation is used for classification:
6. Training Objective
References
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint, arXiv:2010.11929.
Comments
Post a Comment