Computer Vision Foundations and Model Architectures

Foundations of Computer Vision and Model Architectures

Computer Vision (CV) focuses on enabling machines to understand visual data. Modern CV systems rely on deep neural networks that perform tasks such as image classification, object detection, and image segmentation. This blog provides a structured overview of these tasks and the most commonly used architectures behind them.

Rather than treating models as black boxes, we focus on
why each architecture was introduced,
what problems it solved,
and where it is used today.

1. Core Vision Tasks

Image Classification

Assigns a single label (or multiple labels) to an entire image.

$$ \hat{y} = \arg\max_y p(y \mid x) $$

Object Detection

Predicts both what objects are present and where they are.

$$ (\text{class}, x, y, w, h) $$

Segmentation

Assigns a class label to each pixel.

$$ p(y_i \mid x) $$

Classification answers what, detection answers what and where,
and segmentation answers what, where, and which pixels.

2. Image Classification Architectures

ResNet

Introduced: 2015 (Kaiming He et al., Microsoft Research)

As convolutional networks became deeper, researchers observed the degradation problem: increasing depth led to higher training error, even when overfitting was not the issue. ResNet addresses this by reformulating the learning objective through residual learning, enabling very deep networks to be trained effectively.

Instead of directly learning a mapping $ H(x) $, a residual block learns a residual function:

$$ F(x) = H(x) - x $$

The original mapping can then be recovered via an identity skip connection:

$$ y = F(x) + x $$

This simple formulation allows gradients to flow directly through the identity path during backpropagation:

$$ \frac{\partial y}{\partial x} = \frac{\partial F(x)}{\partial x} + 1 $$

The identity term ensures that gradients do not vanish, making it possible to train extremely deep networks without degradation.

ResNet is composed of stacked residual blocks, typically using two main variants:

Basic blocks: used in ResNet-18 and ResNet-34; consists of two 3×3 convolutions per block.
Bottleneck blocks: used in ResNet-50, 101, and 152; compresses and expands channels with 1×1 → 3×3 → 1×1 convolutions, reducing computation while preserving representational power.

A bottleneck residual block can be represented as:

$$ x \;\rightarrow\; \text{Conv}_{1\times1} \;\rightarrow\; \text{Conv}_{3\times3} \;\rightarrow\; \text{Conv}_{1\times1} + x $$

Enabled training of very deep networks (50–152+ layers)
Improved optimization stability and convergence
Strong generalization across image recognition, detection, and segmentation tasks
Widely adopted as a backbone for detection (FPN, Faster R-CNN) and segmentation models (DeepLab)

EfficientNet

Introduced: 2019 (Google, Mingxing Tan & Quoc V. Le)

EfficientNet revolutionized CNN design by introducing a principled compound scaling method. Instead of arbitrarily scaling network depth, width, or input resolution, EfficientNet scales all three dimensions in a balanced way:

$$ \text{depth} \;\propto \; \alpha^\phi, \quad \text{width} \;\propto \; \beta^\phi, \quad \text{resolution} \;\propto \; \gamma^\phi $$

Here:

$ \phi $ is the compound coefficient controlling overall model size
$ \alpha, \beta, \gamma $ are constants determined via a small grid search to satisfy $ \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 $, ensuring roughly doubled FLOPs when $ \phi $ increases by 1

Key architectural features of EfficientNet include:

MBConv blocks: Mobile Inverted Bottleneck Convolutions from MobileNetV2, with depthwise separable convolutions for efficiency.
Squeeze-and-Excitation (SE) modules: Channel-wise attention mechanism that recalibrates feature maps adaptively, improving representation quality.
Swish activation: Smooth, non-monotonic activation function improving training stability and accuracy.

EfficientNet achieves state-of-the-art accuracy per parameter on ImageNet while using significantly fewer FLOPs compared to older architectures like ResNet or Inception.

By combining MBConv, SE modules, and compound scaling, EfficientNet provides a highly efficient backbone suitable for classification, detection, and segmentation tasks.

ConvNeXt

Introduced: 2022 (Meta AI, Liu et al.)

ConvNeXt revisited convolutional neural networks (CNNs) and modernized them by adopting several design principles inspired by Vision Transformers (ViTs), while retaining the efficiency and inductive biases of convolutions.

The key idea was to show that a well-tuned CNN can match or exceed ViTs in performance on image classification benchmarks.

ConvNeXt introduces several modifications:

Large kernel convolutions: Uses 7×7 depthwise convolutions instead of traditional 3×3, increasing the receptive field for global context:

$$ y_{i,j} = \sum_{m=-3}^{3} \sum_{n=-3}^{3} w_{m,n} \cdot x_{i+m, j+n} $$

LayerNorm instead of BatchNorm: Normalizes features across channels for better stability, especially in deeper networks:

$$ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \gamma + \beta $$

Inverted bottleneck structure: Expands channels before depthwise convolution and projects back, similar to MobileNetV2 blocks.
Stochastic depth: Randomly drops residual blocks during training for regularization:

ConvNeXt achieves competitive performance with Vision Transformers on ImageNet-1k classification while maintaining:

High parameter and FLOPs efficiency
Strong generalization to downstream tasks (detection, segmentation)

By modernizing CNN design with large kernels, LayerNorm, and residual bottlenecks, ConvNeXt demonstrates that convolutional architectures remain highly competitive in the era of Transformers.

3. Object Detection Architectures

ResNet + FPN

Introduced: 2017 (Lin et al., Facebook AI Research)

When we detect objects in images, one big challenge is scale variation: objects can be tiny or huge, but a standard CNN backbone like ResNet produces feature maps at fixed scales. Feature Pyramid Networks (FPN) solve this problem by combining features from different layers of the backbone to create a multi-scale feature pyramid.

Here’s the intuition step by step:

ResNet extracts features: Deep layers capture strong semantic meaning (what the object is), but low resolution; shallow layers retain high spatial resolution (where the object is).
Top-down pathway: Start from the deepest (smallest) feature map, then upsample it to the next higher resolution layer.
Lateral connections: Combine the upsampled feature with the corresponding feature from the backbone (same spatial size) to keep fine details.

Mathematically, this fusion is:

$$ P_l = C_l + \text{Upsample}(P_{l+1}) $$

$C_l$ = feature from ResNet at level $l$
$P_l$ = final pyramid feature at level $l$
$\text{Upsample}(P_{l+1})$ = upsampled higher-level feature

The result is a set of feature maps at different scales, each containing both semantic richness and spatial detail. These features are then used by the detection heads (like Faster R-CNN or RetinaNet) to detect objects of all sizes.

Helps detect small objects using high-resolution fused features
Keeps large-object features semantically strong
Allows a single backbone to serve multiple scales efficiently

Think of FPN as giving your network “super vision”: it can see tiny details without losing the big picture. That’s why almost every modern detector uses FPN with a ResNet backbone.

EfficientDet

Introduced: 2020 (Tan et al., Google AI)

EfficientDet is a modern object detector designed for both high accuracy and efficiency. It builds on two main ideas:

EfficientNet backbone: Provides strong, lightweight feature extraction for the image, balancing depth, width, and resolution efficiently.
Bi-directional Feature Pyramid Network (BiFPN): Improves multi-scale feature fusion for detecting objects of all sizes.

Here’s the reasoning behind BiFPN step by step:

Problem with standard FPN: Normal FPN fuses features in a top-down pathway only, giving equal weight to all inputs.
Bi-directional fusion: BiFPN introduces both top-down and bottom-up paths, allowing features to influence each other in both directions.
Weighted feature fusion: BiFPN learns the importance of each input automatically, instead of just adding them equally:

$$ \hat{P}_l = \frac{\sum_i w_i \cdot P_i}{\sum_i w_i + \epsilon} $$

$P_i$ = input features to the fusion node
$w_i$ = learnable weight for each input
$\epsilon$ = small constant to avoid division by zero

This weighted fusion ensures the network focuses more on the informative features at each level, improving both small and large object detection.

EfficientDet also introduces compound scaling for the backbone, BiFPN, and box/class prediction layers, producing models from D0 (lightweight) to D7 (very powerful) while maintaining efficiency.

Strong accuracy-to-parameter ratio compared to traditional detectors
Efficient for real-time or resource-limited applications
Automatically balances multi-scale features through learnable fusion

Think of EfficientDet as giving your detector a “smart lens”: it decides which features matter most at each scale while staying lightweight and fast.

MobileNet (SSD / YOLO Backbones)

Introduced: 2017 (Howard et al., Google)

MobileNet was designed for real-time applications on edge devices like smartphones and drones. The key idea is to reduce computation while maintaining reasonable accuracy, making it ideal as a backbone for lightweight object detectors such as SSD and YOLO.

Core Idea: Depthwise Separable Convolutions

Traditional convolution combines spatial and channel-wise operations in one step, which is computationally expensive. MobileNet factorizes this into two steps:

$$ \text{Conv}_{\text{standard}}: H \times W \times C_{in} \rightarrow H \times W \times C_{out} $$ $$ \text{Cost} \sim H \cdot W \cdot C_{in} \cdot C_{out} \cdot K \cdot K $$

$$ \text{Depthwise Separable Conv: } \text{Depthwise } (K \times K \text{ per channel}) + \text{Pointwise } (1 \times 1 \text{ across channels}) $$ $$ \text{Cost} \sim H \cdot W \cdot C_{in} \cdot K^2 + H \cdot W \cdot C_{in} \cdot C_{out} $$

This reduces computation roughly by a factor of $ \frac{1}{C_{out}} + \frac{1}{K^2} $, enabling real-time inference on low-power devices.

MobileNet as a Backbone

SSD (Single Shot Detector): MobileNet features are used for multi-scale detection while keeping the network lightweight.
YOLO Variants: MobileNet provides a fast backbone that balances speed and accuracy for embedded systems.
Low latency and small model size are critical for edge deployment.

MobileNet is like giving your object detector “a light engine”: it extracts features efficiently without consuming much power, perfect for real-time edge scenarios.

4. Segmentation Architectures

UNet

Introduced: 2015 (Ronneberger et al., MICCAI)

UNet is a specialized convolutional network for image segmentation, particularly designed for medical imaging tasks where annotated data is scarce. It follows an encoder–decoder architecture with skip connections that combine low-level spatial information with high-level semantic features.

Encoder–Decoder Architecture

The encoder (contracting path) progressively reduces spatial dimensions while increasing feature channels, extracting high-level context:

$$ x_{l+1} = \text{Conv}_{3\times3}(\text{ReLU}(\text{Conv}_{3\times3}(x_l))) $$ $$ x_{l+1} = \text{MaxPool}(x_{l+1}) $$

The decoder (expanding path) upsamples the features to recover spatial resolution:

$$ y_{l} = \text{UpConv}(y_{l+1}) + x_{l} \quad \text{(skip connection)} $$ $$ y_{l} = \text{Conv}_{3\times3}(\text{ReLU}(\text{Conv}_{3\times3}(y_l))) $$

Skip Connections

The skip connections directly transfer feature maps from the encoder to the decoder. Mathematically, if $ x_l $ is the encoder feature and $ y_l $ is the decoder feature at the same level:

$$ y_l = f_{\text{decoder}}(y_{l+1}) + x_l $$

This allows the network to leverage both high-resolution spatial information and deep semantic context, improving segmentation accuracy, especially for small structures.

Key Strengths

Excellent localization due to skip connections
Efficient with limited training data
Widely used in biomedical imaging and other dense prediction tasks

UNet can be thought of as a “funnel with bridges”: the encoder compresses context, the decoder reconstructs details, and skip connections act as bridges to preserve fine-grained spatial information.

DeepLab

Introduced: 2016–2018 (Chen et al., Google)

DeepLab is a semantic segmentation architecture that focuses on capturing multi-scale contextual information while maintaining high spatial resolution. It achieves this using atrous (dilated) convolutions and a specially designed Atrous Spatial Pyramid Pooling (ASPP) module.

Atrous (Dilated) Convolutions

Standard convolutions reduce spatial resolution as the network deepens, which can hurt segmentation accuracy. Atrous convolutions introduce a dilation rate $r$, which effectively enlarges the convolutional kernel without increasing the number of parameters:

$$ y[i] = \sum_k x[i + r \cdot k] \cdot w[k] $$

Here:

$x$ = input feature map
$w$ = convolution kernel
$r$ = dilation rate (spacing between kernel elements)

This allows DeepLab to capture larger receptive fields while preserving spatial resolution.

Atrous Spatial Pyramid Pooling (ASPP)

To further capture multi-scale context, DeepLab uses ASPP, which applies parallel atrous convolutions with different dilation rates:

$$ \text{ASPP}(x) = \text{Concat}(\text{Conv}_{r_1}(x), \text{Conv}_{r_2}(x), \dots, \text{GlobalAvgPool}(x)) $$

The concatenated features combine information from multiple scales, improving segmentation of objects of different sizes.

Backbones

DeepLab commonly uses powerful CNN backbones such as ResNet or Xception to extract deep semantic features before applying ASPP.

Key Strengths

High-resolution feature maps for accurate segmentation
Multi-scale context via ASPP
Compatible with strong backbone networks
Widely used in semantic segmentation benchmarks (PASCAL VOC, Cityscapes)

DeepLab can be seen as “zooming out and then focusing”: atrous convolutions expand the receptive field without downsampling, while ASPP gathers multi-scale context for precise segmentation.

5. Embedded and Mobile Models

Embedded and mobile models are designed to run efficiently on resource-constrained devices such as smartphones, drones, and IoT hardware. The key goal is to reduce computation, memory footprint, and latency, while maintaining competitive accuracy.

MobileNet

Introduced: 2017 (Howard et al., Google)

MobileNet uses depthwise separable convolutions, which factorize standard convolutions into two steps:

$$ \text{Conv}_{\text{standard}}: H \times W \times M \times N \quad \rightarrow \quad \text{DepthwiseConv} + \text{PointwiseConv} $$

Here:

Depthwise convolution: applies a single filter per input channel
Pointwise convolution: 1×1 convolution to combine channels

This reduces computation and parameters dramatically:

$$ \text{Cost reduction} \approx \frac{1}{N} + \frac{1}{k^2} $$

Low latency, suitable for mobile devices
Used as backbone for SSD, YOLO-lite, and other lightweight detectors

ShuffleNet

Introduced: 2018 (Zhang et al., Megvii)

ShuffleNet focuses on group convolutions and channel shuffling to reduce computation while maintaining cross-channel information flow:

Group convolutions split channels into groups to save computation
Channel shuffle mixes features across groups to prevent information bottlenecks

This design achieves high efficiency and competitive accuracy for mobile vision tasks.

EfficientNet-Lite

Introduced: 2019 (Google)

EfficientNet-Lite is a mobile-optimized variant of EfficientNet that keeps the compound scaling strategy but introduces:

Quantization-friendly operations for integer inference
Smaller models for low-power devices
Maintains high accuracy per parameter

Mobile and embedded models strike a balance between accuracy, speed, and memory usage, making them essential for real-time AI applications on devices with limited resources.

6. Transfer Learning Backbones

In modern computer vision, transfer learning is a key strategy: models pre-trained on large datasets (e.g., ImageNet) are used as feature extractors for downstream tasks such as detection, segmentation, or classification. Using pre-trained backbones drastically reduces training time and improves performance, especially when labeled data is limited.

VGG

Introduced: 2014 (Simonyan & Zisserman)

VGG networks are known for their simple and uniform design using 3×3 convolutions stacked deeply:

VGG-16 and VGG-19 are popular variants
Deep but parameter-heavy (~138M parameters for VGG-16)
Effective feature extractor, but computationally expensive

ResNet

Introduced: 2015 (He et al.)

ResNet remains the most widely used backbone due to residual connections, which allow very deep networks without vanishing gradients:

Variants: ResNet-18, 34, 50, 101, 152
Residual blocks: $y = F(x) + x$
Strong generalization for detection, segmentation, and classification

EfficientNet

Introduced: 2019 (Tan & Le)

EfficientNet achieves high accuracy per parameter using compound scaling:

Scales depth, width, and resolution simultaneously
Variants: EfficientNet-B0 to B7, and Lite versions for mobile
Excellent trade-off between model size, speed, and accuracy

Modern computer vision pipelines almost always rely on pre-trained backbones (VGG, ResNet, EfficientNet, etc.) and fine-tune them for downstream tasks, providing a strong starting point for both accuracy and efficiency.

Learning to Learn

Computer Vision Foundations and Model Architectures

Foundations of Computer Vision and Model Architectures

1. Core Vision Tasks

Image Classification

Object Detection

Segmentation

2. Image Classification Architectures

ResNet

EfficientNet

ConvNeXt

3. Object Detection Architectures

ResNet + FPN

EfficientDet

MobileNet (SSD / YOLO Backbones)

Core Idea: Depthwise Separable Convolutions

MobileNet as a Backbone

4. Segmentation Architectures

UNet

Encoder–Decoder Architecture

Skip Connections

Key Strengths

DeepLab

Atrous (Dilated) Convolutions

Atrous Spatial Pyramid Pooling (ASPP)

Backbones

Key Strengths

5. Embedded and Mobile Models

MobileNet

ShuffleNet

EfficientNet-Lite

6. Transfer Learning Backbones

VGG

ResNet

EfficientNet

Labels

Comments

Post a Comment

Popular posts from this blog

DINOv3

DINOv2

LeJEPA: Predictive Learning With Isotropic Latent Spaces