Deep Learning: From Simple Linear Pieces to Powerful Models

Deep Learning — From Local Linearity to Compact Architectures

At its heart, deep learning is a function-approximation engine. The central intuition is simple: zoom in on a complex curve and it looks almost linear. Neural networks exploit that by composing many linear transformations with nonlinear activations — producing a highly expressive, piecewise-linear (or smooth) approximation of the target function.

1. Local Linearity — the basic building block

The basic operation in a neural network is a linear transformation followed by a nonlinearity:

$$ \mathbf{z} = W \mathbf{x} + b, \quad \mathbf{a} = \sigma(\mathbf{z}) $$

Repeated over layers:

$$ f(\mathbf{x}) = \sigma_n(W_n(\sigma_{n-1}(W_{n-1}(...\sigma_1(W_1 \mathbf{x} + b_1)...)+b_{n-1}) ) + b_n) $$

Notation:

\( W_i \), \( b_i \): learnable weights and biases

\( \sigma_i \): nonlinear activation (ReLU, GELU, etc.)

\( f(\mathbf{x}) \): deep network function mapping input to output

2. Problem: Many local linear pieces → huge model

Approximating a complex function with tiny locally-linear patches can explode the number of pieces and parameters. If each neuron or small subnetwork learns an independent region, you quickly end up with models too large to train or deploy.

Key idea: introduce assumptions (inductive biases) to share and reuse pieces across regions, compressing the representation.

3. How architectures compress the pieces

Here are practical strategies for compressing many local linear pieces into compact, efficient networks:

CNNs — spatial locality & weight sharing

Small kernels slide across images; weights are shared across locations, massively reducing parameters.

Transformers — attention & dynamic routing

Use attention to focus only where interactions matter. Conditional computation avoids modeling unnecessary pieces.

Patch-based models (ViT, MobileViT)

Split images into patches. Fewer tokens = fewer pieces. MobileViT mixes convolutions with attention for efficiency.

Depthwise separable convolutions

Separate spatial and channel mixing; reduces parameters ~8–9× (used in MobileNet).

Bottlenecks & residual connections

Compress features into smaller subspace and expand. Skip connections make learning corrections easier.

Sparsity, pruning & MoE

Only a subset of parameters activate per input (Mixture-of-Experts), keeping effective pieces low but global capacity high.

Compression & deployment tricks

Quantization: low-precision weights/activations (8-bit, 4-bit).
Pruning: remove unimportant weights post-training.
Knowledge distillation: train a smaller model to mimic a larger one.
Parameter-efficient fine-tuning: LoRA, adapters, prompt tuning.
Neural architecture search (NAS): find compact architectures automatically.

4. Newer ideas worth knowing

  
Equivariance / group convolutions: encode symmetries to reuse pieces across transformations.
Sparsely-gated models & routing: conditional compute per input.
Low-rank factorization: decompose weight matrices into smaller factors.
Attention approximations: Performer, Linformer, BigBird to reduce quadratic cost.
Mixture-of-Adapters & modular learning: compose small task-specific modules.

5. Putting it together — an intuition recipe

Start with local linearity: think in terms of small linear pieces.
Ask: what structure in data can you exploit? locality, symmetry, sequence, permutation.
Choose architectural priors: convs for locality, attention for interactions, recurrence for sequences.
Compress with math: bottlenecks, separable operations, low-rank, quantization.
Use conditional compute: sparsity, MoE, routing for efficiency.

6. Short mathematical note

Piecewise-linear activations (like ReLU) allow deep networks to represent highly expressive functions. Each layer composes linear maps with “kinks” (breakpoints), and stacking layers multiplies these breakpoints, creating exponentially more linear regions as depth increases.

Rough 1-D example: each neuron adds a 'kink'; stacking layers generates combinatorial piecewise-linear regions.

7. Quick analogy

Imagine tracing a wavy shoreline. You could use a million tiny straight planks—but a smarter approach is to notice repeating patterns and reuse modular pieces. That’s what neural network architectures do: layers capture reusable structures, efficiently building complex functions.

Learning to Learn

Search This Blog

Deep Learning: From Simple Linear Pieces to Powerful Models

Deep Learning — From Local Linearity to Compact Architectures

1. Local Linearity — the basic building block

2. Problem: Many local linear pieces → huge model

3. How architectures compress the pieces

CNNs — spatial locality & weight sharing

Transformers — attention & dynamic routing

Patch-based models (ViT, MobileViT)

Depthwise separable convolutions

Bottlenecks & residual connections

Sparsity, pruning & MoE

Compression & deployment tricks

4. Newer ideas worth knowing

5. Putting it together — an intuition recipe

6. Short mathematical note

7. Quick analogy

Labels

Comments

Post a Comment

Popular posts from this blog

DINOv3

Vision Transformers

DINOv2