Foundations of Computer Vision and Model Architectures
Computer Vision (CV) focuses on enabling machines to understand visual data. Modern CV systems rely on deep neural networks that perform tasks such as image classification, object detection, and image segmentation. This blog provides a structured overview of these tasks and the most commonly used architectures behind them.
1. Core Vision Tasks
Image Classification
Assigns a single label (or multiple labels) to an entire image.
Object Detection
Predicts both what objects are present and where they are.
Segmentation
Assigns a class label to each pixel.
2. Image Classification Architectures
ResNet
Introduced: 2015 (Kaiming He et al., Microsoft Research)
As convolutional networks became deeper, researchers observed the degradation problem: increasing depth led to higher training error, even when overfitting was not the issue. ResNet addresses this by reformulating the learning objective through residual learning, enabling very deep networks to be trained effectively.
Instead of directly learning a mapping \( H(x) \), a residual block learns a residual function:
The original mapping can then be recovered via an identity skip connection:
This simple formulation allows gradients to flow directly through the identity path during backpropagation:
ResNet is composed of stacked residual blocks, typically using two main variants:
- Basic blocks: used in ResNet-18 and ResNet-34; consists of two 3×3 convolutions per block.
- Bottleneck blocks: used in ResNet-50, 101, and 152; compresses and expands channels with 1×1 → 3×3 → 1×1 convolutions, reducing computation while preserving representational power.
A bottleneck residual block can be represented as:
- Enabled training of very deep networks (50–152+ layers)
- Improved optimization stability and convergence
- Strong generalization across image recognition, detection, and segmentation tasks
- Widely adopted as a backbone for detection (FPN, Faster R-CNN) and segmentation models (DeepLab)
EfficientNet
Introduced: 2019 (Google, Mingxing Tan & Quoc V. Le)
EfficientNet revolutionized CNN design by introducing a principled compound scaling method. Instead of arbitrarily scaling network depth, width, or input resolution, EfficientNet scales all three dimensions in a balanced way:
Here:
- \( \phi \) is the compound coefficient controlling overall model size
- \( \alpha, \beta, \gamma \) are constants determined via a small grid search to satisfy \( \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 \), ensuring roughly doubled FLOPs when \( \phi \) increases by 1
Key architectural features of EfficientNet include:
- MBConv blocks: Mobile Inverted Bottleneck Convolutions from MobileNetV2, with depthwise separable convolutions for efficiency.
- Squeeze-and-Excitation (SE) modules: Channel-wise attention mechanism that recalibrates feature maps adaptively, improving representation quality.
- Swish activation: Smooth, non-monotonic activation function improving training stability and accuracy.
EfficientNet achieves state-of-the-art accuracy per parameter on ImageNet while using significantly fewer FLOPs compared to older architectures like ResNet or Inception.
ConvNeXt
Introduced: 2022 (Meta AI, Liu et al.)
ConvNeXt revisited convolutional neural networks (CNNs) and modernized them by adopting several design principles inspired by Vision Transformers (ViTs), while retaining the efficiency and inductive biases of convolutions.
The key idea was to show that a well-tuned CNN can match or exceed ViTs in performance on image classification benchmarks.
ConvNeXt introduces several modifications:
- Large kernel convolutions: Uses 7×7 depthwise convolutions instead of traditional 3×3, increasing the receptive field for global context:
- LayerNorm instead of BatchNorm: Normalizes features across channels for better stability, especially in deeper networks:
- Inverted bottleneck structure: Expands channels before depthwise convolution and projects back, similar to MobileNetV2 blocks.
- Stochastic depth: Randomly drops residual blocks during training for regularization:
ConvNeXt achieves competitive performance with Vision Transformers on ImageNet-1k classification while maintaining:
- High parameter and FLOPs efficiency
- Strong generalization to downstream tasks (detection, segmentation)
3. Object Detection Architectures
ResNet + FPN
Introduced: 2017 (Lin et al., Facebook AI Research)
When we detect objects in images, one big challenge is scale variation: objects can be tiny or huge, but a standard CNN backbone like ResNet produces feature maps at fixed scales. Feature Pyramid Networks (FPN) solve this problem by combining features from different layers of the backbone to create a multi-scale feature pyramid.
Here’s the intuition step by step:
- ResNet extracts features: Deep layers capture strong semantic meaning (what the object is), but low resolution; shallow layers retain high spatial resolution (where the object is).
- Top-down pathway: Start from the deepest (smallest) feature map, then upsample it to the next higher resolution layer.
- Lateral connections: Combine the upsampled feature with the corresponding feature from the backbone (same spatial size) to keep fine details.
Mathematically, this fusion is:
- \(C_l\) = feature from ResNet at level \(l\)
- \(P_l\) = final pyramid feature at level \(l\)
- \(\text{Upsample}(P_{l+1})\) = upsampled higher-level feature
The result is a set of feature maps at different scales, each containing both semantic richness and spatial detail. These features are then used by the detection heads (like Faster R-CNN or RetinaNet) to detect objects of all sizes.
- Helps detect small objects using high-resolution fused features
- Keeps large-object features semantically strong
- Allows a single backbone to serve multiple scales efficiently
EfficientDet
Introduced: 2020 (Tan et al., Google AI)
EfficientDet is a modern object detector designed for both high accuracy and efficiency. It builds on two main ideas:
- EfficientNet backbone: Provides strong, lightweight feature extraction for the image, balancing depth, width, and resolution efficiently.
- Bi-directional Feature Pyramid Network (BiFPN): Improves multi-scale feature fusion for detecting objects of all sizes.
Here’s the reasoning behind BiFPN step by step:
- Problem with standard FPN: Normal FPN fuses features in a top-down pathway only, giving equal weight to all inputs.
- Bi-directional fusion: BiFPN introduces both top-down and bottom-up paths, allowing features to influence each other in both directions.
- Weighted feature fusion: BiFPN learns the importance of each input automatically, instead of just adding them equally:
- \(P_i\) = input features to the fusion node
- \(w_i\) = learnable weight for each input
- \(\epsilon\) = small constant to avoid division by zero
This weighted fusion ensures the network focuses more on the informative features at each level, improving both small and large object detection.
EfficientDet also introduces compound scaling for the backbone, BiFPN, and box/class prediction layers, producing models from D0 (lightweight) to D7 (very powerful) while maintaining efficiency.
- Strong accuracy-to-parameter ratio compared to traditional detectors
- Efficient for real-time or resource-limited applications
- Automatically balances multi-scale features through learnable fusion
MobileNet (SSD / YOLO Backbones)
Introduced: 2017 (Howard et al., Google)
MobileNet was designed for real-time applications on edge devices like smartphones and drones. The key idea is to reduce computation while maintaining reasonable accuracy, making it ideal as a backbone for lightweight object detectors such as SSD and YOLO.
Core Idea: Depthwise Separable Convolutions
Traditional convolution combines spatial and channel-wise operations in one step, which is computationally expensive. MobileNet factorizes this into two steps:
This reduces computation roughly by a factor of \( \frac{1}{C_{out}} + \frac{1}{K^2} \), enabling real-time inference on low-power devices.
MobileNet as a Backbone
- SSD (Single Shot Detector): MobileNet features are used for multi-scale detection while keeping the network lightweight.
- YOLO Variants: MobileNet provides a fast backbone that balances speed and accuracy for embedded systems.
- Low latency and small model size are critical for edge deployment.
4. Segmentation Architectures
UNet
Introduced: 2015 (Ronneberger et al., MICCAI)
UNet is a specialized convolutional network for image segmentation, particularly designed for medical imaging tasks where annotated data is scarce. It follows an encoder–decoder architecture with skip connections that combine low-level spatial information with high-level semantic features.
Encoder–Decoder Architecture
The encoder (contracting path) progressively reduces spatial dimensions while increasing feature channels, extracting high-level context:
The decoder (expanding path) upsamples the features to recover spatial resolution:
Skip Connections
The skip connections directly transfer feature maps from the encoder to the decoder. Mathematically, if \( x_l \) is the encoder feature and \( y_l \) is the decoder feature at the same level:
This allows the network to leverage both high-resolution spatial information and deep semantic context, improving segmentation accuracy, especially for small structures.
Key Strengths
- Excellent localization due to skip connections
- Efficient with limited training data
- Widely used in biomedical imaging and other dense prediction tasks
DeepLab
Introduced: 2016–2018 (Chen et al., Google)
DeepLab is a semantic segmentation architecture that focuses on capturing multi-scale contextual information while maintaining high spatial resolution. It achieves this using atrous (dilated) convolutions and a specially designed Atrous Spatial Pyramid Pooling (ASPP) module.
Atrous (Dilated) Convolutions
Standard convolutions reduce spatial resolution as the network deepens, which can hurt segmentation accuracy. Atrous convolutions introduce a dilation rate \(r\), which effectively enlarges the convolutional kernel without increasing the number of parameters:
Here:
- \(x\) = input feature map
- \(w\) = convolution kernel
- \(r\) = dilation rate (spacing between kernel elements)
This allows DeepLab to capture larger receptive fields while preserving spatial resolution.
Atrous Spatial Pyramid Pooling (ASPP)
To further capture multi-scale context, DeepLab uses ASPP, which applies parallel atrous convolutions with different dilation rates:
The concatenated features combine information from multiple scales, improving segmentation of objects of different sizes.
Backbones
DeepLab commonly uses powerful CNN backbones such as ResNet or Xception to extract deep semantic features before applying ASPP.
Key Strengths
- High-resolution feature maps for accurate segmentation
- Multi-scale context via ASPP
- Compatible with strong backbone networks
- Widely used in semantic segmentation benchmarks (PASCAL VOC, Cityscapes)
5. Embedded and Mobile Models
Embedded and mobile models are designed to run efficiently on resource-constrained devices such as smartphones, drones, and IoT hardware. The key goal is to reduce computation, memory footprint, and latency, while maintaining competitive accuracy.
MobileNet
Introduced: 2017 (Howard et al., Google)
MobileNet uses depthwise separable convolutions, which factorize standard convolutions into two steps:
Here:
- Depthwise convolution: applies a single filter per input channel
- Pointwise convolution: 1×1 convolution to combine channels
This reduces computation and parameters dramatically:
- Low latency, suitable for mobile devices
- Used as backbone for SSD, YOLO-lite, and other lightweight detectors
ShuffleNet
Introduced: 2018 (Zhang et al., Megvii)
ShuffleNet focuses on group convolutions and channel shuffling to reduce computation while maintaining cross-channel information flow:
- Group convolutions split channels into groups to save computation
- Channel shuffle mixes features across groups to prevent information bottlenecks
This design achieves high efficiency and competitive accuracy for mobile vision tasks.
EfficientNet-Lite
Introduced: 2019 (Google)
EfficientNet-Lite is a mobile-optimized variant of EfficientNet that keeps the compound scaling strategy but introduces:
- Quantization-friendly operations for integer inference
- Smaller models for low-power devices
- Maintains high accuracy per parameter
6. Transfer Learning Backbones
In modern computer vision, transfer learning is a key strategy: models pre-trained on large datasets (e.g., ImageNet) are used as feature extractors for downstream tasks such as detection, segmentation, or classification. Using pre-trained backbones drastically reduces training time and improves performance, especially when labeled data is limited.
VGG
Introduced: 2014 (Simonyan & Zisserman)
VGG networks are known for their simple and uniform design using 3×3 convolutions stacked deeply:
- VGG-16 and VGG-19 are popular variants
- Deep but parameter-heavy (~138M parameters for VGG-16)
- Effective feature extractor, but computationally expensive
ResNet
Introduced: 2015 (He et al.)
ResNet remains the most widely used backbone due to residual connections, which allow very deep networks without vanishing gradients:
- Variants: ResNet-18, 34, 50, 101, 152
- Residual blocks: \(y = F(x) + x\)
- Strong generalization for detection, segmentation, and classification
EfficientNet
Introduced: 2019 (Tan & Le)
EfficientNet achieves high accuracy per parameter using compound scaling:
- Scales depth, width, and resolution simultaneously
- Variants: EfficientNet-B0 to B7, and Lite versions for mobile
- Excellent trade-off between model size, speed, and accuracy







Comments
Post a Comment