ViTs work in the following manner:
1. Patch embedding: It divides images into
patches. So, for a
image, number of patches
, then
and after stacking these patch embeddings we will get patch embedding matrix
.
2. Positional embedding: Since Transformers don’t inherently handle spatial information like CNNs, positional encodings are added to each patch embedding to provide information about the position of each patch in the image.
.
3. Attention mechanism:
4. Feed-forward network:
5. Final logits:
6. Loss function:
Comments
Post a Comment