Skip to main content

Posts

Showing posts from January, 2025

Vision Transformers

Vision transformer(ViT) is a transformer based deep learning model primarily used for image classification task. It processes images by dividing them into patches, then learns relationships between these patches using the Transformer architecture. After processing, it generates a classification output, just like other models designed for image classification, such as Convolutional Neural Networks (CNNs).  ViTs work in the following manner:  1. Patch embedding: It divides images into patches. So, for a image, number of patches , then and after stacking these patch embeddings we will get patch embedding matrix .  2. Positional embedding: Since Transformers don’t inherently handle spatial information like CNNs, positional encodings are added to each patch embedding to provide information about the position of each patch in the image. .  3. Attention mechanism: , , , , .  4. Feed-forward network: ...