Skip to main content

Posts

Showing posts from November, 2025

Classical Computer Vision Methods

Classical Computer Vision Methods Classical Computer Vision Methods This article provides a complete, concise, mathematically supported explanation of all important classical computer vision techniques . This includes edge detection, feature descriptors, tracking, segmentation, transforms, stereo, motion analysis, and more. 1. Edge, Corner & Keypoint Detectors 1.1 Sobel, Prewitt, Roberts Operators These detect edges by convolving horizontal and vertical gradient kernels. $$ G_x = I * S_x, \qquad G_y = I * S_y $$ $$ |G| = \sqrt{G_x^2 + G_y^2} $$ 1.2 Laplacian of Gaussian (LoG) Detects edges via second derivatives and zero-crossings. $$ \text{LoG}(x) = \nabla^2 (G_\sigma * I) $$ 1.3 Difference of Gaussian (DoG) $$ \text{DoG} = G_{\sigma_1} - G_{\sigma_2} $$ 1.4 Canny Edge Detector Involves Gaussian smoothing, gradient computation, non-max su...

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

Deep Learning: From Simple Linear Pieces to Powerful Models

Deep Learning: From Simple Linear Pieces to Powerful Models Deep Learning — From Local Linearity to Compact Architectures At its heart, deep learning is a function-approximation engine. The central intuition is simple: zoom in on a complex curve and it looks almost linear . Neural networks exploit that by composing many linear transformations with nonlinear activations — producing a highly expressive, piecewise-linear (or smooth) approximation of the target function. 1. Local Linearity — the basic building block The basic operation in a neural network is a linear transformation followed by a nonlinearity: $$ \mathbf{z} = W \mathbf{x} + b, \quad \mathbf{a} = \sigma(\mathbf{z}) $$ Repeated over layers: $$ f(\mathbf{x}) = \sigma_n(W_n(\sigma_{n-1}(W_{n-1}(...\sigma_1(W_1 \mathbf{x} + b_1)...)+b_{n-1}) ) + b_n) $$ Notation: \( W_i \), \( b_i \): learnable weights and biases \( \sigma_i \): nonlinear activation (ReLU, GELU, etc.) \( f(\mathbf...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...

Variational Inference

Variational Inference: A Mathematical Explanation Variational Inference (VI) Variational Inference (VI) is a mathematical framework for approximating complex posterior distributions in probabilistic models. Instead of sampling (as in Monte Carlo methods), VI transforms inference into an optimization problem, making it scalable and efficient for deep learning. 1. Problem Setup We consider a probabilistic model with observed variables \( x \) and latent variables \( z \). The joint distribution is defined as: $$ p_\theta(x, z) = p_\theta(x \mid z)\, p_\theta(z) $$ Notation: \( x \): observed data \( z \): latent (hidden) variable \( p_\theta(x \mid z) \): likelihood (decoder) \( p_\theta(z) \): prior distribution \( \theta \): model parameters The goal is to find parameters that maximize the marginal likelihood of the observed data: $$ p_\theta(x) = \int p_\theta(x, z)\, dz $$ However, this integral is often intractable due to hig...

TrOCR

TrOCR: A Mathematical Explanation TrOCR TrOCR (Transformer-based Optical Character Recognition) maps an input image \( \mathbf{x} \) into a text sequence \( \mathbf{y} = (y_1, \dots, y_T) \). It models the conditional probability: $$ p(\mathbf{y} \mid \mathbf{x}) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, \mathbf{x}) $$ Notation: \( \mathbf{x} \): input image, \( y_t \): predicted token at step \( t \), \( y_{<t} \): sequence of previously generated tokens, \( T \): output sequence length. TrOCR consists of two major components: A Vision Transformer (ViT) as the encoder A Text Transformer as the decoder 1. Encoder — Vision Transformer (ViT) 1.1 Image to Patch Embeddings Input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times 3} $$ The image is divided into \( N = \frac{H \times W}{P^2} \) patches of size \( P \times P \). Each patch is flattened and projected into a D-dimensional embedding: $$ \mathbf{z}_0^{...