Deep Learning — From Local Linearity to Compact Architectures
At its heart, deep learning is a function-approximation engine. The central intuition is simple: zoom in on a complex curve and it looks almost linear. Neural networks exploit that by composing many linear transformations with nonlinear activations — producing a highly expressive, piecewise-linear (or smooth) approximation of the target function.
1. Local Linearity — the basic building block
The basic operation in a neural network is a linear transformation followed by a nonlinearity:
Repeated over layers:
\( W_i \), \( b_i \): learnable weights and biases
\( \sigma_i \): nonlinear activation (ReLU, GELU, etc.)
\( f(\mathbf{x}) \): deep network function mapping input to output
2. Problem: Many local linear pieces → huge model
Approximating a complex function with tiny locally-linear patches can explode the number of pieces and parameters. If each neuron or small subnetwork learns an independent region, you quickly end up with models too large to train or deploy.
3. How architectures compress the pieces
Here are practical strategies for compressing many local linear pieces into compact, efficient networks:
CNNs — spatial locality & weight sharing
Small kernels slide across images; weights are shared across locations, massively reducing parameters.
Transformers — attention & dynamic routing
Use attention to focus only where interactions matter. Conditional computation avoids modeling unnecessary pieces.
Patch-based models (ViT, MobileViT)
Split images into patches. Fewer tokens = fewer pieces. MobileViT mixes convolutions with attention for efficiency.
Depthwise separable convolutions
Separate spatial and channel mixing; reduces parameters ~8–9× (used in MobileNet).
Bottlenecks & residual connections
Compress features into smaller subspace and expand. Skip connections make learning corrections easier.
Sparsity, pruning & MoE
Only a subset of parameters activate per input (Mixture-of-Experts), keeping effective pieces low but global capacity high.
Compression & deployment tricks
- Quantization: low-precision weights/activations (8-bit, 4-bit).
- Pruning: remove unimportant weights post-training.
- Knowledge distillation: train a smaller model to mimic a larger one.
- Parameter-efficient fine-tuning: LoRA, adapters, prompt tuning.
- Neural architecture search (NAS): find compact architectures automatically.
4. Newer ideas worth knowing
- Equivariance / group convolutions: encode symmetries to reuse pieces across transformations.
- Sparsely-gated models & routing: conditional compute per input.
- Low-rank factorization: decompose weight matrices into smaller factors.
- Attention approximations: Performer, Linformer, BigBird to reduce quadratic cost.
- Mixture-of-Adapters & modular learning: compose small task-specific modules.
5. Putting it together — an intuition recipe
- Start with local linearity: think in terms of small linear pieces.
- Ask: what structure in data can you exploit? locality, symmetry, sequence, permutation.
- Choose architectural priors: convs for locality, attention for interactions, recurrence for sequences.
- Compress with math: bottlenecks, separable operations, low-rank, quantization.
- Use conditional compute: sparsity, MoE, routing for efficiency.
6. Short mathematical note
Piecewise-linear activations (like ReLU) allow deep networks to represent highly expressive functions. Each layer composes linear maps with “kinks” (breakpoints), and stacking layers multiplies these breakpoints, creating exponentially more linear regions as depth increases.
7. Quick analogy
Imagine tracing a wavy shoreline. You could use a million tiny straight planks—but a smarter approach is to notice repeating patterns and reuse modular pieces. That’s what neural network architectures do: layers capture reusable structures, efficiently building complex functions.




Comments
Post a Comment