Skip to main content

Classical Computer Vision Methods

Classical Computer Vision Methods

Classical Computer Vision Methods

This article provides a complete, concise, mathematically supported explanation of all important classical computer vision techniques. This includes edge detection, feature descriptors, tracking, segmentation, transforms, stereo, motion analysis, and more.

1. Edge, Corner & Keypoint Detectors

1.1 Sobel, Prewitt, Roberts Operators

These detect edges by convolving horizontal and vertical gradient kernels.

$$ G_x = I * S_x, \qquad G_y = I * S_y $$
$$ |G| = \sqrt{G_x^2 + G_y^2} $$

1.2 Laplacian of Gaussian (LoG)

Detects edges via second derivatives and zero-crossings.

$$ \text{LoG}(x) = \nabla^2 (G_\sigma * I) $$

1.3 Difference of Gaussian (DoG)

$$ \text{DoG} = G_{\sigma_1} - G_{\sigma_2} $$

1.4 Canny Edge Detector

Involves Gaussian smoothing, gradient computation, non-max suppression, hysteresis thresholding.

1.5 Harris Corner Detector

$$ M= \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix} $$
$$ R = \det(M) - k(\text{trace}(M))^2 $$

1.6 Shi–Tomasi Corner Detector

$$ R = \min(\lambda_1,\lambda_2) $$
This figure demonstrates classical edge and corner detection techniques applied to the same grayscale image. It includes Sobel and Laplacian of Gaussian (LoG) for gradient-based edge detection, Difference of Gaussian (DoG) for multi-scale edge detection, and Canny for robust edge extraction. Corner detection is shown with Harris, Shi-Tomasi, and FAST, highlighting distinctive points in the image. Each method emphasizes different structural details, helping visualize edges, corners, and key features for image analysis tasks.

1.7 SUSAN (Smallest Univalue Segment Assimilating Nucleus)

Counts pixels similar to the center; corner occurs when USAN area is small.

1.8 FAST (Features from Accelerated Segment Test)

Checks N pixels on a circle if they are brighter/darker than the center.

1.9 AGAST (Adaptive and Generic Accelerated Segment Test)

FAST improved using adaptive decision trees.

2. Feature Descriptors

2.1 SIFT (Scale-Invariant Feature Transform)

Uses DoG keypoints, orientation histograms, and 128-D descriptors.

2.2 SURF (Speeded Up Robust Features)

Efficient LoG approximation using box filters and integral images.

2.3 BRIEF (Binary Robust Independent Elementary Features)

$$ d_i = \begin{cases} 1 & I(p_i) < I(q_i) \\ 0 & \text{otherwise} \end{cases} $$

2.4 ORB (Oriented FAST and Rotated BRIEF)

Combines FAST keypoints with rotation-corrected BRIEF descriptors.

2.5 BRISK (Binary Robust Invariant Scalable Keypoints)

Scale-space sampling and binary intensity comparisons.

2.6 FREAK (Fast Retina Keypoint)

Binary descriptor based on retina-inspired sampling.

2.7 HOG (Histogram of Oriented Gradients)

Histograms of gradient orientations inside cells.

This figure illustrates various classical feature detection and description techniques applied to the same grayscale image. It includes SIFT, ORB, BRISK for keypoint detection, highlighting distinctive points in the image; HOG for gradient-based texture representation; LBP for local texture patterns; and an approximate GIST descriptor using multiple Gabor filters to capture global scene structure. Each method visualizes different aspects of the image, helping in feature extraction, texture analysis, and object recognition tasks.

2.8 LBP (Local Binary Patterns)

$$ \text{LBP}(x)=\sum_{p=0}^{P-1}s(I_p-I_c)2^p $$

2.9 Shape Context

Histogram describing relative spatial distribution of points.

2.10 GIST Descriptor

Global scene representation using multi-scale Gabor filters.

3. Feature Matching & Tracking

3.1 RANSAC (Random Sample Consensus)

Fits a robust model by sampling minimal sets and counting inliers.

3.2 Lucas–Kanade Optical Flow

$$ I_x u + I_y v + I_t = 0 $$

3.3 Horn–Schunck Optical Flow

$$ E = \iint (I_x u + I_y v + I_t)^2 + \lambda(|\nabla u|^2 + |\nabla v|^2)\,dx\,dy $$

3.4 KLT (Kanade–Lucas–Tomasi) Tracker

Tracks Shi–Tomasi features using Lucas–Kanade optical flow.

4. Filters & Transform Methods

4.1 Gaussian Filter

$$ G_\sigma(x)=\frac{1}{2\pi\sigma^2}e^{-\frac{x^2}{2\sigma^2}} $$

4.2 Median Filter

Replaces each pixel with the neighborhood median.

4.3 Bilateral Filter

$$ I'(x)=\frac{1}{W_x}\sum_p I(p)\, e^{-\frac{\|x-p\|^2}{2\sigma_s^2}} e^{-\frac{|I(x)-I(p)|^2}{2\sigma_r^2}} $$

4.4 Anisotropic Diffusion (Perona–Malik)

$$ \frac{\partial I}{\partial t}=\nabla \cdot (c(\|\nabla I\|)\nabla I) $$

4.5 Fourier Transform

$$ F(u,v)=\sum_x\sum_y I(x,y)e^{-j2\pi(ux/M+vy/N)} $$

4.6 Discrete Cosine Transform (DCT)

Used in JPEG compression.

4.7 Wavelet Transform

Multi-resolution analysis using scalable basis functions.

This figure shows the same grayscale image processed with different techniques. It includes smoothing filters (Gaussian, Median, Bilateral), edge-preserving denoising (Anisotropic Diffusion), frequency analysis (Fourier Transform), multi-scale approximation (Wavelet), and line detection (Hough Transform). Each method highlights different aspects of the image, helping visualize noise reduction, structure, and key features.

4.8 Radon Transform

$$ R(\rho,\theta)=\int I(x,y)\delta(\rho - x\cos\theta - y\sin\theta)\,dx\,dy $$

4.9 Hough Transform

Votes in parameter space to detect lines and shapes.

5. Segmentation Methods

5.1 K-means Segmentation

$$ \arg\min \sum_i\|x_i - \mu_{c_i}\|^2 $$

5.2 Graph Cut

$$ E(L)=U(L)+V(L) $$

5.3 GrabCut

Uses Graph Cut + Gaussian Mixture Models.

5.4 Watershed

Treats gradient magnitude as a topographic map.

5.5 Mean Shift Segmentation

Clusters by shifting data toward local maxima.

5.6 Felzenszwalb–Huttenlocher Algorithm

Graph-based region merging based on internal variation.

5.7 SLIC (Simple Linear Iterative Clustering) Superpixels

Clusters in Lab + xy 5-dimensional space.

This figure shows several classical image segmentation techniques applied to the same input image. It includes K-means clustering, GrabCut foreground extraction, Watershed based on image gradients, Felzenszwalb graph-based segmentation, and SLIC superpixels. Each method splits the image into meaningful regions in different ways, highlighting boundaries, objects, and structural elements.

5.8 Active Contours (Snakes)

$$ E = \alpha |\mathbf{v}'|^2 + \beta |\mathbf{v}''|^2 + E_{\text{image}} $$

5.9 Level Set Methods

$$ \frac{\partial \phi}{\partial t} = F|\nabla\phi| $$

6. Classical Object Detection

6.1 Viola–Jones (Haar Cascade)

Uses Haar features, integral images, AdaBoost, and cascaded classifiers.

6.2 HOG + SVM (Support Vector Machine)

$$ \min_w\|w\|^2 \quad \text{s.t. } y_i(w^\top x_i + b) \ge 1 $$

6.3 Deformable Part Models (DPM)

$$ S = w_0\phi(\text{root}) + \sum_i (w_i\phi(\text{part}_i) - d_i) $$
This figure demonstrates traditional object detection and tracking techniques. It includes Viola–Jones face detection, HOG+SVM pedestrian detection, and Template Matching for locating repeated patterns. Motion-related methods include Optical Flow, Background Subtraction, and Mean Shift tracking, showcasing how classical algorithms analyze movement and detect objects in images.

6.4 Template Matching

$$ R(x,y)=\sum_{u,v} I(x+u,y+v)T(u,v) $$

7. Stereo Vision & 3D

7.1 Block Matching

$$ \text{SAD}(d)=\sum |I_L(x,y)-I_R(x-d,y)| $$

7.2 Semi-Global Matching (SGM)

Aggregates matching cost across multiple directions.

7.3 Epipolar Geometry

$$ x_2^\top F x_1 = 0 $$

7.4 Essential Matrix

$$ E = [t]_\times R $$

7.5 Triangulation

$$ X = \arg\min_X \sum_i \|x_i - P_i X\|^2 $$

7.6 Structure from Motion (SfM)

Estimates camera poses and 3D structure from multiple views.

7.7 Bundle Adjustment

$$ \arg\min \sum_{i,j} \|x_{ij} - P_i X_j\|^2 $$

7.8 Visual Odometry

Estimates camera motion using sequential feature correspondences.

8. Motion Analysis & Tracking

8.1 Background Subtraction (Mixture of Gaussians - MOG/MOG2)

$$ p(x)=\sum_k w_k \mathcal{N}(\mu_k,\Sigma_k) $$

8.2 Kalman Filter

$$ x_k = A x_{k-1}+w,\qquad z_k = Hx_k+v $$

8.3 Particle Filter

Approximates posterior distribution using weighted particles.

8.4 Mean Shift Tracking

Tracks objects by iteratively shifting kernel windows.

8.5 CAMShift (Continuously Adaptive Mean Shift)

Enhances Mean Shift with adaptive window size.

References

  • Gonzalez, R. C. & Woods, R. E. (2018). Digital Image Processing (4th ed.).
  • Szeliski, R. (2011). Computer Vision: Algorithms and Applications.
  • Forsyth, D. A. & Ponce, J. (2012). Computer Vision: A Modern Approach (2nd ed.).

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention. 1. Image to Patch Embeddings The input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$ is divided into non-overlapping patches of size \( P \times P \), giving a total of $$ N = \frac{H \times W}{P^2} $$ patches. Each patch \( \mathbf{x}^{(i)} \) is flattened and linearly projected into a \( D \)-dimensional embedding: $$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$ After stacking all patch embeddings, we form: $$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathb...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...