Stochastic Depth — Training a Deep Network That's Sometimes Shallow Stochastic Depth — Training a Deep Network That's Sometimes Shallow Most regularization techniques you can name work at the level of activations or weights. Dropout zeroes out random neurons. Weight decay shrinks individual parameters. Label smoothing softens output distributions. These all share an assumption: the structure of the network is fixed, and we just perturb what flows through it. Stochastic depth makes a stranger move. It perturbs the network itself. Each training step, entire layers vanish. The 50-layer ResNet is sometimes a 48-layer network, sometimes a 45-layer one — the depth is a random variable. By the time training finishes, the optimizer hasn't really trained one network. It has trained an ensemble of subnetworks that share the same weights but differ in which blocks they actually use. This turns out to be a remarkably effective regularizer, and it is the main...