Basis Change and Matrix Approximation with SVD

Basis Change:

In this blog we will see how the transformation matrix for any linear mapping changes with the change in the basis. Let's assume we have a linear mapping $\phi:V \rightarrow W$ with $V \in \mathbb{R}^n$ & $W \in \mathbb{R}^m$ , ordered bases for vector spaces $V$ and $W$ are $B=(b_1,...,b_n)$ and $C=(c_1,...,c_m)$ respectively and we are changing these bases to $\tilde{B}=(\tilde{b}_1,...,\tilde{b}_n)$ and $\tilde{C}=(\tilde{c}_1,...,\tilde{c}_m)$ for vector spaces $V$ and $W$ respectively. Also assume the transformation matrix in case of bases $B$ & $C$ is $A$ and after changing the bases to $\tilde{B}$ & $\tilde{C}$ it becomes $\tilde{A}$ .

In this blog I will be taking subscripts $i$ , $j$ , $k$ and $l$ to represent vectors in the bases $B$ , $\tilde{B}$ , $C$ and $\tilde{C}$ respectively.

$\phi:V \rightarrow W$

$B(b_i) \rightarrow \tilde{B}(\tilde{b}_j)$

$C(c_k) \rightarrow \tilde{C}(\tilde{c}_l)$

As we know basis vectors spans the entire vector space, so $B$ will span the entire vector space $V$ , so $B$ will also span the basis vectors of $\tilde{B}$ . So we can say that each new basis vector $\tilde{b}_j$ of $\tilde{B}$ will be the linear combinations of the basis vectors $b_i$ of $B$ and we can write it as, $\tilde{b}_j = \sum(s_{ij} \cdot b_i)$ $\forall j=1,...,n$ .

We store these coefficients in column-wise manner in a matrix (say $S$ ), so entries in the $j^{th}$ column of $S$ will be the coefficients for $\tilde{b}_j$ , we can represent $S$ as, $S=[s_{:,1},...,s_{:,j},...,s_{:,n}] \in \mathbb{R}^{n \times n}$ .

Similarly assume for $C$ and $\tilde{C}$ we denote the coefficient matrix by $T$ , so we can represent it as, $T=[t_{:,1},...,t_{:,k},...,t_{:,m}] \in \mathbb{R}^{m \times m}$ and $\tilde{c}_l = \sum_{k=1}^{m}(t_{kl} \cdot c_k)$ $\forall l=1,...,m$ .

$\phi$ is linearly mapping vectors from $V$ to $W$ so we can write,

$\phi(\tilde{b}_j) = \sum_{l=1}^{m} \tilde{a}_{lj} \cdot \tilde{c}_l = \sum_{l=1}^{m} \tilde{a}_{lj} \sum_{k=1}^{m} t_{kl} \cdot c_k = \sum_{k=1}^{m} \left( \sum_{l=1}^{m} t_{kl} \cdot \tilde{a}_{lj} \right) \cdot c_k$

Also we can write,

$\phi(\tilde{b}_j) = \phi \left(\sum_{i=1}^{n}s_{ij} b_i \right) = \sum_{i=1}^{n} s_{ij} \phi(b_i) = \sum_{i=1}^{n} s_{ij} \sum_{k=1}^{m} \left(a_{ki}s_{ij} \right) c_k = \sum_{k=1}^{m} \left( \sum_{i=1}^{n} a_{ki} s_{ij} \right) c_k$

Now from above two equations we can write,

$\sum_{l=1}^{m} t_{kl} \tilde{a}_{lj} = \sum_{i=1}^{n} a_{ki} s_{ij}$

$\Rightarrow T \tilde{A} = AS$

$\Rightarrow \tilde{A} = T^{-1}AS$

So, this way we can compute the transformation matrix $\tilde{A}$ .

Matrix approximation with Singular Value Decomposition(SVD):

SVD factors a rectangular matrix $\left( A \in \mathbb{R}^{m \times n} \right)$ into three parts in the following manner: $A=USV^{T}$ , here $U$ and $V$ are called left and right-singular vector matrices. Left and right-singular vectors of $A$ are eigenvectors of $AA^{T}$ and $A^{T}A$ respectively. Matrix $S$ is a diagonal matrix and the diagonal elements ( $\sigma_i's$ ) are called singular values and are square root of eigenvalues of $AA^{T}$ or $A^{T}A$ .

We construct a rank-1 matrix $A_i \in \mathbb{R}^{m \times n}$ as,

$A_{i} = u_{i} v_{i}^{T}$

this way we can approximate $A$ as a rank- $k$ matrix as:

$\tilde{A}(k)=\sum_{i=1}^{k} \sigma_{i} u_{i} v_{i}^{T} = \sum_{i=1}^{k} \sigma_{i} A_{i}$

Remark: We take singular vectors corresponding to the large singular values.

Now, I will be considering an image of a Pug (say 'I'), we will see how the rank-1 approximation looks like visually and how a set of rank (k = [1,2,3,4,10,20,50,100]) approximation looks.

Images I1 to I4 are the outer product of $u_1,...u_4$ with $v_1,...v_4$ . Rank-1 approximation is same as I1, rank-2 approximation is the sum of two outer products $\sigma_1 u_1 v_{1}^{T}+\sigma_2 u_2 v_{2}^{T}$ or $\sigma_1 I_1 + \sigma_2 I_2$ , similarly rank-3 approximation is $\sigma_1 u_1 v_{1}^{T}+\sigma_2 u_2 v_{2}^{T}+\sigma_3 u_3 v_{3}^{T}$ or $\sigma_1 I_1 + \sigma_2 I_2 + \sigma_3 I_3$ .

References:

mml-book

Comments

Wow concepts!!

Dimensionality Reduction

Hi, In this blog I will be discussing some trade-offs we make while choosing a dimensionality reduction technique for our problem. Now, let's jump into this directly. Dimensionality reduction(DR) reduces higher dimensional data to lower dimensions. Or we can say that DR maps -dimensional data into -dimensions ( ), ( ), where these new -dimensions hold nearly all of the relevant information about the original data. Sometimes DR results can show clusters of data that are not even present in the original data and sometimes it can map two neighbors from the higher dimension far into the lower dimension. So let's discuss and compare some methods which can prevent these problems. I will be discussing t-SNE, UMAP, and TriMap in this blog. 1. t-SNE (t-distributed Stochastic Neighborhood Embedding) t-SNE uses the distance between two points in higher dimensions and maps it to the lower dimension. where is computed by using binary search in the equation, Perplexity = ...

Explore more 🐱‍🏍

Linear Regression

Linear regression is used to predict real-valued output for a given input data point . Linear regression establishes a relationship of dependent variable with the features of the input data with an assumption that the expected value of the output(dependent variable) is a linear function of the input ( ). Let's assume our training dataset is where is the number of data points and is the number of dimension or number of features in our dataset. From now on we will write our dataset as where each for is a column vector. We can write the output as: or we can write it as: Before computing the final weights for this equation, we need to figure out what degree we should choose. We usually select the degree for which we get less mean squared error(MSE). The most common form of linear regression is degree 1 form: There are two ways by which we can estimate the parameters: Normal equation: Weight vector is estimated by matrix multiplication o...

Explore more 🐱‍🏍

Vision Transformers

Vision transformer(ViT) is a transformer based deep learning model primarily used for image classification task. It processes images by dividing them into patches, then learns relationships between these patches using the Transformer architecture. After processing, it generates a classification output, just like other models designed for image classification, such as Convolutional Neural Networks (CNNs). ViTs work in the following manner: 1. Patch embedding: It divides images into patches. So, for a image, number of patches , then and after stacking these patch embeddings we will get patch embedding matrix . 2. Positional embedding: Since Transformers don’t inherently handle spatial information like CNNs, positional encodings are added to each patch embedding to provide information about the position of each patch in the image. . 3. Attention mechanism: , , , , . 4. Feed-forward network: ...

Explore more 🐱‍🏍

Sequence Networks

Hi, in this blog I will be covering most of the sequence networks. Sequence networks can have either input as sequence or output as sequence or both(input and output) as sequences. We can sub-divide these sequence networks in the following three ways: Vec2Seq Seq2Vec Seq2Seq 1. Vec2Seq(Sequence Generation): where is hidden state and (initial hidden state distribution). For categorical and real-valued output, the distributions are given by: The above generative model is called Recurrent Neural Network(RNN) . 2. Seq2Vec(Sequence Classification): In classification task, the output is class label, We get better results if we let the hidden states depends on past as well as future context bidirectional RNN . Then we define, hidden state at time , . where 3. Seq2Seq(Sequence Translation): Aligned case where the initial state, ...

Explore more 🐱‍🏍

Global and Local Models

Recently we have seen a number of chatbots(chatGPT, BARD etc). These comes under LLMs(large language models) which are trained on huge datasets. I will discuss the methods which these models follow to give personalized and accurate responses. I will break this into three steps: Information gathering: gathers users search history, topics discussed, queries etc Information clustring: clusters related concepts, facts and ideas Personalized responses: searches information cluster to find most relevant and helpful information The information clustering process is a knowledge acquisition and adaptation process, which shares some similarity with online learning but has some key differences. Knowledge acquisition(KA) vs subset selection(SS): KA is a broader concept, it encompasses various methods and processes of gathering and accumulating new knowledge or information. This includes learning from experiences, interactions, readings, and various sources while SS is a narrower concepts. It re...

Explore more 🐱‍🏍

Learning to Learn

Search This Blog