Skip to main content

Computer Vision Foundations and Model Architectures

Computer Vision Foundations and Model Architectures

Foundations of Computer Vision and Model Architectures

Computer Vision (CV) focuses on enabling machines to understand visual data. Modern CV systems rely on deep neural networks that perform tasks such as image classification, object detection, and image segmentation. This blog provides a structured overview of these tasks and the most commonly used architectures behind them.

Rather than treating models as black boxes, we focus on why each architecture was introduced, what problems it solved, and where it is used today.

1. Core Vision Tasks

Image Classification

Assigns a single label (or multiple labels) to an entire image.

$$ \hat{y} = \arg\max_y p(y \mid x) $$

Object Detection

Predicts both what objects are present and where they are.

$$ (\text{class}, x, y, w, h) $$

Segmentation

Assigns a class label to each pixel.

$$ p(y_i \mid x) $$
Classification answers what, detection answers what and where, and segmentation answers what, where, and which pixels.

2. Image Classification Architectures

ResNet

Introduced: 2015 (Kaiming He et al., Microsoft Research)

As convolutional networks became deeper, researchers observed the degradation problem: increasing depth led to higher training error, even when overfitting was not the issue. ResNet addresses this by reformulating the learning objective through residual learning, enabling very deep networks to be trained effectively.

Instead of directly learning a mapping \( H(x) \), a residual block learns a residual function:

$$ F(x) = H(x) - x $$

The original mapping can then be recovered via an identity skip connection:

$$ y = F(x) + x $$

This simple formulation allows gradients to flow directly through the identity path during backpropagation:

$$ \frac{\partial y}{\partial x} = \frac{\partial F(x)}{\partial x} + 1 $$
The identity term ensures that gradients do not vanish, making it possible to train extremely deep networks without degradation.

ResNet is composed of stacked residual blocks, typically using two main variants:

  • Basic blocks: used in ResNet-18 and ResNet-34; consists of two 3×3 convolutions per block.
  • Bottleneck blocks: used in ResNet-50, 101, and 152; compresses and expands channels with 1×1 → 3×3 → 1×1 convolutions, reducing computation while preserving representational power.

A bottleneck residual block can be represented as:

$$ x \;\rightarrow\; \text{Conv}_{1\times1} \;\rightarrow\; \text{Conv}_{3\times3} \;\rightarrow\; \text{Conv}_{1\times1} + x $$
  • Enabled training of very deep networks (50–152+ layers)
  • Improved optimization stability and convergence
  • Strong generalization across image recognition, detection, and segmentation tasks
  • Widely adopted as a backbone for detection (FPN, Faster R-CNN) and segmentation models (DeepLab)

EfficientNet

Introduced: 2019 (Google, Mingxing Tan & Quoc V. Le)

EfficientNet revolutionized CNN design by introducing a principled compound scaling method. Instead of arbitrarily scaling network depth, width, or input resolution, EfficientNet scales all three dimensions in a balanced way:

$$ \text{depth} \;\propto \; \alpha^\phi, \quad \text{width} \;\propto \; \beta^\phi, \quad \text{resolution} \;\propto \; \gamma^\phi $$

Here:

  • \( \phi \) is the compound coefficient controlling overall model size
  • \( \alpha, \beta, \gamma \) are constants determined via a small grid search to satisfy \( \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 \), ensuring roughly doubled FLOPs when \( \phi \) increases by 1

Key architectural features of EfficientNet include:

  • MBConv blocks: Mobile Inverted Bottleneck Convolutions from MobileNetV2, with depthwise separable convolutions for efficiency.
  • Squeeze-and-Excitation (SE) modules: Channel-wise attention mechanism that recalibrates feature maps adaptively, improving representation quality.
  • Swish activation: Smooth, non-monotonic activation function improving training stability and accuracy.

EfficientNet achieves state-of-the-art accuracy per parameter on ImageNet while using significantly fewer FLOPs compared to older architectures like ResNet or Inception.

By combining MBConv, SE modules, and compound scaling, EfficientNet provides a highly efficient backbone suitable for classification, detection, and segmentation tasks.

ConvNeXt

Introduced: 2022 (Meta AI, Liu et al.)

ConvNeXt revisited convolutional neural networks (CNNs) and modernized them by adopting several design principles inspired by Vision Transformers (ViTs), while retaining the efficiency and inductive biases of convolutions.

The key idea was to show that a well-tuned CNN can match or exceed ViTs in performance on image classification benchmarks.

ConvNeXt introduces several modifications:

  • Large kernel convolutions: Uses 7×7 depthwise convolutions instead of traditional 3×3, increasing the receptive field for global context:
$$ y_{i,j} = \sum_{m=-3}^{3} \sum_{n=-3}^{3} w_{m,n} \cdot x_{i+m, j+n} $$
  • LayerNorm instead of BatchNorm: Normalizes features across channels for better stability, especially in deeper networks:
$$ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \gamma + \beta $$
  • Inverted bottleneck structure: Expands channels before depthwise convolution and projects back, similar to MobileNetV2 blocks.
  • Stochastic depth: Randomly drops residual blocks during training for regularization:

ConvNeXt achieves competitive performance with Vision Transformers on ImageNet-1k classification while maintaining:

  • High parameter and FLOPs efficiency
  • Strong generalization to downstream tasks (detection, segmentation)
By modernizing CNN design with large kernels, LayerNorm, and residual bottlenecks, ConvNeXt demonstrates that convolutional architectures remain highly competitive in the era of Transformers.

3. Object Detection Architectures

ResNet + FPN

Introduced: 2017 (Lin et al., Facebook AI Research)

When we detect objects in images, one big challenge is scale variation: objects can be tiny or huge, but a standard CNN backbone like ResNet produces feature maps at fixed scales. Feature Pyramid Networks (FPN) solve this problem by combining features from different layers of the backbone to create a multi-scale feature pyramid.

Here’s the intuition step by step:

  1. ResNet extracts features: Deep layers capture strong semantic meaning (what the object is), but low resolution; shallow layers retain high spatial resolution (where the object is).
  2. Top-down pathway: Start from the deepest (smallest) feature map, then upsample it to the next higher resolution layer.
  3. Lateral connections: Combine the upsampled feature with the corresponding feature from the backbone (same spatial size) to keep fine details.

Mathematically, this fusion is:

$$ P_l = C_l + \text{Upsample}(P_{l+1}) $$
  • \(C_l\) = feature from ResNet at level \(l\)
  • \(P_l\) = final pyramid feature at level \(l\)
  • \(\text{Upsample}(P_{l+1})\) = upsampled higher-level feature

The result is a set of feature maps at different scales, each containing both semantic richness and spatial detail. These features are then used by the detection heads (like Faster R-CNN or RetinaNet) to detect objects of all sizes.

  • Helps detect small objects using high-resolution fused features
  • Keeps large-object features semantically strong
  • Allows a single backbone to serve multiple scales efficiently
Think of FPN as giving your network “super vision”: it can see tiny details without losing the big picture. That’s why almost every modern detector uses FPN with a ResNet backbone.

EfficientDet

Introduced: 2020 (Tan et al., Google AI)

EfficientDet is a modern object detector designed for both high accuracy and efficiency. It builds on two main ideas:

  1. EfficientNet backbone: Provides strong, lightweight feature extraction for the image, balancing depth, width, and resolution efficiently.
  2. Bi-directional Feature Pyramid Network (BiFPN): Improves multi-scale feature fusion for detecting objects of all sizes.

Here’s the reasoning behind BiFPN step by step:

  1. Problem with standard FPN: Normal FPN fuses features in a top-down pathway only, giving equal weight to all inputs.
  2. Bi-directional fusion: BiFPN introduces both top-down and bottom-up paths, allowing features to influence each other in both directions.
  3. Weighted feature fusion: BiFPN learns the importance of each input automatically, instead of just adding them equally:
$$ \hat{P}_l = \frac{\sum_i w_i \cdot P_i}{\sum_i w_i + \epsilon} $$
  • \(P_i\) = input features to the fusion node
  • \(w_i\) = learnable weight for each input
  • \(\epsilon\) = small constant to avoid division by zero

This weighted fusion ensures the network focuses more on the informative features at each level, improving both small and large object detection.

EfficientDet also introduces compound scaling for the backbone, BiFPN, and box/class prediction layers, producing models from D0 (lightweight) to D7 (very powerful) while maintaining efficiency.

  • Strong accuracy-to-parameter ratio compared to traditional detectors
  • Efficient for real-time or resource-limited applications
  • Automatically balances multi-scale features through learnable fusion
Think of EfficientDet as giving your detector a “smart lens”: it decides which features matter most at each scale while staying lightweight and fast.

MobileNet (SSD / YOLO Backbones)

Introduced: 2017 (Howard et al., Google)

MobileNet was designed for real-time applications on edge devices like smartphones and drones. The key idea is to reduce computation while maintaining reasonable accuracy, making it ideal as a backbone for lightweight object detectors such as SSD and YOLO.

Core Idea: Depthwise Separable Convolutions

Traditional convolution combines spatial and channel-wise operations in one step, which is computationally expensive. MobileNet factorizes this into two steps:

$$ \text{Conv}_{\text{standard}}: H \times W \times C_{in} \rightarrow H \times W \times C_{out} $$ $$ \text{Cost} \sim H \cdot W \cdot C_{in} \cdot C_{out} \cdot K \cdot K $$
$$ \text{Depthwise Separable Conv: } \text{Depthwise } (K \times K \text{ per channel}) + \text{Pointwise } (1 \times 1 \text{ across channels}) $$ $$ \text{Cost} \sim H \cdot W \cdot C_{in} \cdot K^2 + H \cdot W \cdot C_{in} \cdot C_{out} $$

This reduces computation roughly by a factor of \( \frac{1}{C_{out}} + \frac{1}{K^2} \), enabling real-time inference on low-power devices.

MobileNet as a Backbone

  • SSD (Single Shot Detector): MobileNet features are used for multi-scale detection while keeping the network lightweight.
  • YOLO Variants: MobileNet provides a fast backbone that balances speed and accuracy for embedded systems.
  • Low latency and small model size are critical for edge deployment.
MobileNet is like giving your object detector “a light engine”: it extracts features efficiently without consuming much power, perfect for real-time edge scenarios.

4. Segmentation Architectures

UNet

Introduced: 2015 (Ronneberger et al., MICCAI)

UNet is a specialized convolutional network for image segmentation, particularly designed for medical imaging tasks where annotated data is scarce. It follows an encoder–decoder architecture with skip connections that combine low-level spatial information with high-level semantic features.

Encoder–Decoder Architecture

The encoder (contracting path) progressively reduces spatial dimensions while increasing feature channels, extracting high-level context:

$$ x_{l+1} = \text{Conv}_{3\times3}(\text{ReLU}(\text{Conv}_{3\times3}(x_l))) $$ $$ x_{l+1} = \text{MaxPool}(x_{l+1}) $$

The decoder (expanding path) upsamples the features to recover spatial resolution:

$$ y_{l} = \text{UpConv}(y_{l+1}) + x_{l} \quad \text{(skip connection)} $$ $$ y_{l} = \text{Conv}_{3\times3}(\text{ReLU}(\text{Conv}_{3\times3}(y_l))) $$

Skip Connections

The skip connections directly transfer feature maps from the encoder to the decoder. Mathematically, if \( x_l \) is the encoder feature and \( y_l \) is the decoder feature at the same level:

$$ y_l = f_{\text{decoder}}(y_{l+1}) + x_l $$

This allows the network to leverage both high-resolution spatial information and deep semantic context, improving segmentation accuracy, especially for small structures.

Key Strengths

  • Excellent localization due to skip connections
  • Efficient with limited training data
  • Widely used in biomedical imaging and other dense prediction tasks
UNet can be thought of as a “funnel with bridges”: the encoder compresses context, the decoder reconstructs details, and skip connections act as bridges to preserve fine-grained spatial information.

DeepLab

Introduced: 2016–2018 (Chen et al., Google)

DeepLab is a semantic segmentation architecture that focuses on capturing multi-scale contextual information while maintaining high spatial resolution. It achieves this using atrous (dilated) convolutions and a specially designed Atrous Spatial Pyramid Pooling (ASPP) module.

Atrous (Dilated) Convolutions

Standard convolutions reduce spatial resolution as the network deepens, which can hurt segmentation accuracy. Atrous convolutions introduce a dilation rate \(r\), which effectively enlarges the convolutional kernel without increasing the number of parameters:

$$ y[i] = \sum_k x[i + r \cdot k] \cdot w[k] $$

Here:

  • \(x\) = input feature map
  • \(w\) = convolution kernel
  • \(r\) = dilation rate (spacing between kernel elements)

This allows DeepLab to capture larger receptive fields while preserving spatial resolution.

Atrous Spatial Pyramid Pooling (ASPP)

To further capture multi-scale context, DeepLab uses ASPP, which applies parallel atrous convolutions with different dilation rates:

$$ \text{ASPP}(x) = \text{Concat}(\text{Conv}_{r_1}(x), \text{Conv}_{r_2}(x), \dots, \text{GlobalAvgPool}(x)) $$

The concatenated features combine information from multiple scales, improving segmentation of objects of different sizes.

Backbones

DeepLab commonly uses powerful CNN backbones such as ResNet or Xception to extract deep semantic features before applying ASPP.

Key Strengths

  • High-resolution feature maps for accurate segmentation
  • Multi-scale context via ASPP
  • Compatible with strong backbone networks
  • Widely used in semantic segmentation benchmarks (PASCAL VOC, Cityscapes)
DeepLab can be seen as “zooming out and then focusing”: atrous convolutions expand the receptive field without downsampling, while ASPP gathers multi-scale context for precise segmentation.

5. Embedded and Mobile Models

Embedded and mobile models are designed to run efficiently on resource-constrained devices such as smartphones, drones, and IoT hardware. The key goal is to reduce computation, memory footprint, and latency, while maintaining competitive accuracy.

MobileNet

Introduced: 2017 (Howard et al., Google)

MobileNet uses depthwise separable convolutions, which factorize standard convolutions into two steps:

$$ \text{Conv}_{\text{standard}}: H \times W \times M \times N \quad \rightarrow \quad \text{DepthwiseConv} + \text{PointwiseConv} $$

Here:

  • Depthwise convolution: applies a single filter per input channel
  • Pointwise convolution: 1×1 convolution to combine channels

This reduces computation and parameters dramatically:

$$ \text{Cost reduction} \approx \frac{1}{N} + \frac{1}{k^2} $$
  • Low latency, suitable for mobile devices
  • Used as backbone for SSD, YOLO-lite, and other lightweight detectors

ShuffleNet

Introduced: 2018 (Zhang et al., Megvii)

ShuffleNet focuses on group convolutions and channel shuffling to reduce computation while maintaining cross-channel information flow:

  • Group convolutions split channels into groups to save computation
  • Channel shuffle mixes features across groups to prevent information bottlenecks

This design achieves high efficiency and competitive accuracy for mobile vision tasks.

EfficientNet-Lite

Introduced: 2019 (Google)

EfficientNet-Lite is a mobile-optimized variant of EfficientNet that keeps the compound scaling strategy but introduces:

  • Quantization-friendly operations for integer inference
  • Smaller models for low-power devices
  • Maintains high accuracy per parameter
Mobile and embedded models strike a balance between accuracy, speed, and memory usage, making them essential for real-time AI applications on devices with limited resources.

6. Transfer Learning Backbones

In modern computer vision, transfer learning is a key strategy: models pre-trained on large datasets (e.g., ImageNet) are used as feature extractors for downstream tasks such as detection, segmentation, or classification. Using pre-trained backbones drastically reduces training time and improves performance, especially when labeled data is limited.

VGG

Introduced: 2014 (Simonyan & Zisserman)

VGG networks are known for their simple and uniform design using 3×3 convolutions stacked deeply:

  • VGG-16 and VGG-19 are popular variants
  • Deep but parameter-heavy (~138M parameters for VGG-16)
  • Effective feature extractor, but computationally expensive

ResNet

Introduced: 2015 (He et al.)

ResNet remains the most widely used backbone due to residual connections, which allow very deep networks without vanishing gradients:

  • Variants: ResNet-18, 34, 50, 101, 152
  • Residual blocks: \(y = F(x) + x\)
  • Strong generalization for detection, segmentation, and classification

EfficientNet

Introduced: 2019 (Tan & Le)

EfficientNet achieves high accuracy per parameter using compound scaling:

  • Scales depth, width, and resolution simultaneously
  • Variants: EfficientNet-B0 to B7, and Lite versions for mobile
  • Excellent trade-off between model size, speed, and accuracy
Modern computer vision pipelines almost always rely on pre-trained backbones (VGG, ResNet, EfficientNet, etc.) and fine-tune them for downstream tasks, providing a strong starting point for both accuracy and efficiency.

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

LeJEPA: Predictive Learning With Isotropic Latent Spaces

LeJEPA: Predictive World Models Through Latent Space Prediction LeJEPA: Predictive Learning With Isotropic Latent Spaces Self-supervised learning methods such as MAE, SimCLR, BYOL, DINO, and iBOT all attempt to learn useful representations by predicting missing information. Most of them reconstruct pixels or perform contrastive matching, which forces models to learn low-level details that are irrelevant for semantic understanding. LeJEPA approaches representation learning differently: Instead of reconstructing pixels, the model predicts latent representations of the input, and those representations are regularized to live in a well-conditioned, isotropic space. These animations demonstrate LeJEPA’s ability to predict future latent representations for different types of motion. The first animation shows a dog moving through a scene, highlighting semantic dynamics and object consisten...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...