Skip to main content

Deep Learning: From Simple Linear Pieces to Powerful Models

Deep Learning: From Simple Linear Pieces to Powerful Models

Deep Learning — From Local Linearity to Compact Architectures

At its heart, deep learning is a function-approximation engine. The central intuition is simple: zoom in on a complex curve and it looks almost linear. Neural networks exploit that by composing many linear transformations with nonlinear activations — producing a highly expressive, piecewise-linear (or smooth) approximation of the target function.

1. Local Linearity — the basic building block

The basic operation in a neural network is a linear transformation followed by a nonlinearity:

$$ \mathbf{z} = W \mathbf{x} + b, \quad \mathbf{a} = \sigma(\mathbf{z}) $$

Repeated over layers:

$$ f(\mathbf{x}) = \sigma_n(W_n(\sigma_{n-1}(W_{n-1}(...\sigma_1(W_1 \mathbf{x} + b_1)...)+b_{n-1}) ) + b_n) $$
Notation:
\( W_i \), \( b_i \): learnable weights and biases
\( \sigma_i \): nonlinear activation (ReLU, GELU, etc.)
\( f(\mathbf{x}) \): deep network function mapping input to output

2. Problem: Many local linear pieces → huge model

Approximating a complex function with tiny locally-linear patches can explode the number of pieces and parameters. If each neuron or small subnetwork learns an independent region, you quickly end up with models too large to train or deploy.

Key idea: introduce assumptions (inductive biases) to share and reuse pieces across regions, compressing the representation.

3. How architectures compress the pieces

Here are practical strategies for compressing many local linear pieces into compact, efficient networks:

CNNs — spatial locality & weight sharing

Small kernels slide across images; weights are shared across locations, massively reducing parameters.

Transformers — attention & dynamic routing

Use attention to focus only where interactions matter. Conditional computation avoids modeling unnecessary pieces.

Patch-based models (ViT, MobileViT)

Split images into patches. Fewer tokens = fewer pieces. MobileViT mixes convolutions with attention for efficiency.

Depthwise separable convolutions

Separate spatial and channel mixing; reduces parameters ~8–9× (used in MobileNet).

Bottlenecks & residual connections

Compress features into smaller subspace and expand. Skip connections make learning corrections easier.

Sparsity, pruning & MoE

Only a subset of parameters activate per input (Mixture-of-Experts), keeping effective pieces low but global capacity high.

Compression & deployment tricks

  • Quantization: low-precision weights/activations (8-bit, 4-bit).
  • Pruning: remove unimportant weights post-training.
  • Knowledge distillation: train a smaller model to mimic a larger one.
  • Parameter-efficient fine-tuning: LoRA, adapters, prompt tuning.
  • Neural architecture search (NAS): find compact architectures automatically.

4. Newer ideas worth knowing

  • Equivariance / group convolutions: encode symmetries to reuse pieces across transformations.
  • Sparsely-gated models & routing: conditional compute per input.
  • Low-rank factorization: decompose weight matrices into smaller factors.
  • Attention approximations: Performer, Linformer, BigBird to reduce quadratic cost.
  • Mixture-of-Adapters & modular learning: compose small task-specific modules.

5. Putting it together — an intuition recipe

  1. Start with local linearity: think in terms of small linear pieces.
  2. Ask: what structure in data can you exploit? locality, symmetry, sequence, permutation.
  3. Choose architectural priors: convs for locality, attention for interactions, recurrence for sequences.
  4. Compress with math: bottlenecks, separable operations, low-rank, quantization.
  5. Use conditional compute: sparsity, MoE, routing for efficiency.

6. Short mathematical note

Piecewise-linear activations (like ReLU) allow deep networks to represent highly expressive functions. Each layer composes linear maps with “kinks” (breakpoints), and stacking layers multiplies these breakpoints, creating exponentially more linear regions as depth increases.

Rough 1-D example: each neuron adds a 'kink'; stacking layers generates combinatorial piecewise-linear regions.

7. Quick analogy

Imagine tracing a wavy shoreline. You could use a million tiny straight planks—but a smarter approach is to notice repeating patterns and reuse modular pieces. That’s what neural network architectures do: layers capture reusable structures, efficiently building complex functions.

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention. 1. Image to Patch Embeddings The input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$ is divided into non-overlapping patches of size \( P \times P \), giving a total of $$ N = \frac{H \times W}{P^2} $$ patches. Each patch \( \mathbf{x}^{(i)} \) is flattened and linearly projected into a \( D \)-dimensional embedding: $$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$ After stacking all patch embeddings, we form: $$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathb...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...