Skip to main content

Dimensionality Reduction

Dimensionality Reduction: Trade-offs and Techniques

Dimensionality Reduction: Trade-offs and Techniques

In this blog, we discuss some trade-offs involved when choosing a Dimensionality Reduction (DR) technique. Dimensionality reduction aims to represent high-dimensional data in a lower-dimensional space while preserving as much meaningful structure as possible.

$$ f : \mathbb{R}^n \to \mathbb{R}^k, \quad k < n $$

The goal is to find a mapping \( f \) that captures the essential relationships among the data points in \( \mathbb{R}^n \) within a lower-dimensional space \( \mathbb{R}^k \). However, depending on the technique used, distortions such as artificial clusters or misplaced neighbors may appear. We will focus on three widely used methods: t-SNE, UMAP, and TriMap.

1. t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE models the pairwise similarities between points in both the high-dimensional and low-dimensional spaces using probability distributions.

In the high-dimensional space, the conditional probability of point \( x_j \) being a neighbor of \( x_i \) is defined as:

$$ p_{j|i} = \frac{\exp\!\left(-\frac{\|x_i - x_j\|^2}{2\sigma_i^2}\right)} {\sum_{k \neq i} \exp\!\left(-\frac{\|x_i - x_k\|^2}{2\sigma_i^2}\right)} $$

Here, \( \sigma_i \) controls the local neighborhood size around point \( x_i \) and is chosen such that the perplexity of the distribution equals a user-defined value:

$$ \text{Perplexity}(P_i) = 2^{-\sum_j p_{j|i} \log_2 p_{j|i}} $$

To make the similarity symmetric, we define:

$$ p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n} $$

In the low-dimensional space, a Student-t distribution with one degree of freedom (equivalent to a Cauchy distribution) is used to measure similarity:

$$ q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}} {\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} $$

The cost function minimizes the Kullback–Leibler (KL) divergence between the high- and low-dimensional distributions:

$$ \mathcal{L}_{\text{t-SNE}} = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}} $$

The embedding \( y_i \) is initialized from a small Gaussian distribution:

$$ y_i \sim \mathcal{N}(0, 10^{-4} I) $$
Notation:
\( x_i, x_j \in \mathbb{R}^n \): high-dimensional data points
\( y_i, y_j \in \mathbb{R}^k \): corresponding low-dimensional embeddings
\( \sigma_i \): local variance parameter (determined by perplexity)
\( p_{ij}, q_{ij} \): pairwise similarity probabilities in high and low dimensions
\( n \): number of data points

2. UMAP (Uniform Manifold Approximation and Projection)

UMAP constructs a weighted graph from high-dimensional data that approximates the local manifold structure, then optimizes a corresponding low-dimensional layout that preserves these relationships.

The algorithm proceeds in two stages:

  1. Graph Construction: A fuzzy topological representation is built by connecting each point to its nearest neighbors with membership strength based on distance and local connectivity.
  2. Graph Optimization: The low-dimensional embedding is found by minimizing a cross-entropy loss between high- and low-dimensional graphs.
$$ \mathcal{L}_{\text{UMAP}} = \sum_{(i,j)} \Big[ w_{ij} \log \frac{w_{ij}}{w'_{ij}} + (1 - w_{ij}) \log \frac{1 - w_{ij}}{1 - w'_{ij}} \Big] $$
Notation:
\( w_{ij} \): edge weight (similarity) in high-dimensional graph
\( w'_{ij} \): corresponding edge weight in low-dimensional graph
\( (i,j) \): pair of neighboring points

3. TriMap

TriMap takes a different approach by considering triplets of points rather than pairs. It seeks an embedding that preserves the relative ordering of distances among triplets.

Given a triplet \((i, j, k)\), the objective is to ensure that if \(x_i\) is closer to \(x_j\) than to \(x_k\) in high-dimensional space, the same holds true for their low-dimensional representations:

$$ \|x_i - x_j\| < \|x_i - x_k\| \quad \Longrightarrow \quad \|y_i - y_j\| < \|y_i - y_k\| $$

TriMap minimizes a weighted sum of triplet-based loss terms using a smooth logistic function:

$$ \mathcal{L}_{\text{TriMap}} = \sum_{(i,j,k)} w_{ijk} \, \log\!\left(1 + \exp\!\big(\|y_i - y_j\|^2 - \|y_i - y_k\|^2\big)\right) $$
Notation:
\( (i,j,k) \): triplet of indices
\( w_{ijk} \): weight indicating the importance of triplet \((i,j,k)\)
\( y_i, y_j, y_k \in \mathbb{R}^k \): low-dimensional representations

4. Parametric vs. Non-Parametric Mapping

Dimensionality reduction methods can be categorized as:

  • Parametric: Learn a mapping function \( f_\theta \) so that new points can be directly embedded using the same parameters \( \theta \). Example: Parametric t-SNE, Autoencoders.
  • Non-Parametric: The embedding is learned only for the training data. New points require recomputation of the entire mapping. Example: standard t-SNE, UMAP, TriMap.

References

  1. TriMap: Large-scale Dimensionality Reduction Using Triplets
  2. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
  3. Parametric t-SNE

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention. 1. Image to Patch Embeddings The input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$ is divided into non-overlapping patches of size \( P \times P \), giving a total of $$ N = \frac{H \times W}{P^2} $$ patches. Each patch \( \mathbf{x}^{(i)} \) is flattened and linearly projected into a \( D \)-dimensional embedding: $$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$ After stacking all patch embeddings, we form: $$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathb...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...