Vision transformer(ViT) is a transformer based deep learning model primarily used for image classification task. It processes images by dividing them into patches, then learns relationships between these patches using the Transformer architecture. After processing, it generates a classification output, just like other models designed for image classification, such as Convolutional Neural Networks (CNNs). ViTs work in the following manner: 1. Patch embedding: It divides images into patches. So, for a image, number of patches , then and after stacking these patch embeddings we will get patch embedding matrix . 2. Positional embedding: Since Transformers don’t inherently handle spatial information like CNNs, positional encodings are added to each patch embedding to provide information about the position of each patch in the image. . 3. Attention mechanism: , , , , . 4. Feed-forward network: ...
Linear regression is used to predict real-valued output for a given input data point . Linear regression establishes a relationship of dependent variable with the features of the input data with an assumption that the expected value of the output(dependent variable) is a linear function of the input ( ). Let's assume our training dataset is where is the number of data points and is the number of dimension or number of features in our dataset. From now on we will write our dataset as where each for is a column vector. We can write the output as: or we can write it as: Before computing the final weights for this equation, we need to figure out what degree we should choose. We usually select the degree for which we get less mean squared error(MSE). The most common form of linear regression is degree 1 form: There are two ways by which we can estimate the parameters: Normal equation: Weight vector is estimated by matrix multiplication o...