Part IV of VI

Training Components — Complete Computational Analysis

Loss functions, optimizers, regularization, normalization, and learning rate schedules — every update rule derived, every FLOP counted, every byte of memory tracked

Contents

§12 Loss Functions — Detailed Computation

The loss function is the objective being minimized by gradient descent. Its computational cost appears in every training step (forward pass), and its gradient is the starting point of backpropagation. While the loss computation itself is typically cheap compared to the network's forward pass, the properties of the loss function (smoothness, gradient behavior near zero) have profound effects on convergence speed and therefore total training FLOPs.

12.1 Mean Squared Error (MSE)

MSE Loss
$$L_{\text{MSE}} = \frac{1}{B}\sum_{i=1}^{B} \frac{1}{d}\sum_{j=1}^{d}(y_{ij} - \hat{y}_{ij})^2$$

where $B$ is batch size and $d$ is the output dimension (for regression) or number of classes.

Gradient:

$$\pd{L}{\hat{y}_{ij}} = \frac{2}{Bd}(\hat{y}_{ij} - y_{ij})$$
MSE — Computational Cost
Operation FLOPs
Difference: $\hat{y}_{ij} - y_{ij}$ $Bd$ subtractions
Square: $(\cdot)^2$ $Bd$ multiplications
Sum over $d$ and $B$ $Bd - 1$ additions
Division by $Bd$ 1 division
Total forward $\approx 3Bd$ FLOPs
Gradient computation $Bd$ subtractions + $Bd$ multiplications = $2Bd$ FLOPs

Connection to MLE: Minimizing MSE is equivalent to maximizing the log-likelihood under a Gaussian model $P(y|\hat{y}) = \mathcal{N}(\hat{y}, \sigma^2 I)$ with fixed variance. The gradient $\propto (\hat{y} - y)$ pushes predictions toward targets linearly — no saturation, but also no gradient boost for badly wrong predictions.

12.2 Cross-Entropy Loss

Binary Cross-Entropy (BCE)
$$L_{\text{BCE}} = -\frac{1}{B}\sum_{i=1}^{B}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$

where $\hat{y}_i = \sigma(z_i) \in (0,1)$.

Gradient w.r.t. logit $z_i$ (fused with sigmoid):

$$\pd{L}{z_i} = \frac{1}{B}(\hat{y}_i - y_i)$$
Categorical Cross-Entropy
$$L_{\text{CE}} = -\frac{1}{B}\sum_{i=1}^{B}\sum_{c=1}^{C} y_{ic}\log(\hat{y}_{ic})$$

For one-hot labels (standard case): $L_{\text{CE}} = -\frac{1}{B}\sum_{i=1}^{B}\log(\hat{y}_{i,k_i})$ where $k_i$ is the true class.

Gradient w.r.t. logits (fused with softmax):

$$\pd{L}{\bm{z}_i} = \frac{1}{B}(\hat{\bm{y}}_i - \bm{y}_i) \in \R^C$$
Cross-Entropy — Computational Cost
Variant Forward FLOPs Gradient FLOPs Notes
BCE (one-hot) $\approx 4B$ (2 logs, 2 mults, sum) $B$ (subtraction) Fused with sigmoid
CE, one-hot labels $\approx 2B$ (1 log per sample + sum) $BC$ (subtraction) Fused with softmax
CE, soft labels $\approx 3BC$ (C logs per sample) $BC$ (subtraction) Label smoothing, distillation

Numerical stability: The fused log_softmax computation avoids computing $\log(\text{softmax})$ separately (which loses precision for extreme logits). Instead:

$$\log \text{softmax}(z_c) = z_c - \log\!\left(\sum_j e^{z_j}\right) = z_c - \left(\max(\bm{z}) + \log\!\left(\sum_j e^{z_j - \max(\bm{z})}\right)\right)$$

This log-sum-exp trick costs $\approx 3C$ FLOPs per sample (max + shift + sum of exp + log) and is numerically stable.

Cross-Entropy vs. MSE for Classification — Gradient Behavior

Consider a sample where the true class is $k$ and the predicted probability $\hat{y}_k \approx 0$ (very wrong prediction):

CE gradient: $\pd{L}{z_k} \approx \hat{y}_k - 1 \approx -1$ — strong, constant signal to correct the mistake.

MSE gradient: $\pd{L}{\hat{y}_k} \propto \hat{y}_k - 1 \approx -1$, but this must pass through $\sigma'(z_k)$ which is near zero when $\hat{y}_k$ is near 0 or 1 — the gradient vanishes.

Cross-entropy's gradient is stronger when predictions are badly wrong, which leads to faster convergence and therefore fewer total training FLOPs to reach a given accuracy. This is why cross-entropy is universally preferred for classification.

12.3 Other Loss Functions

Loss Formula Forward FLOPs Use Case
Hinge $\max(0, 1 - y \cdot \hat{y})$ $3B$ (mult, sub, max) SVMs, some classifiers
Huber $\frac{1}{2}(y-\hat{y})^2$ if $|y-\hat{y}| \leq \delta$, else $\delta|y-\hat{y}|-\frac{1}{2}\delta^2$ $\approx 4Bd$ Robust regression
Focal $-\alpha_t(1-\hat{y}_t)^\gamma \log(\hat{y}_t)$ $\approx 6BC$ Class imbalance (detection)
KL Divergence $\sum p(x)\log\frac{p(x)}{q(x)}$ $\approx 5BC$ Knowledge distillation, VAEs

References for §12

[1] Goodfellow et al. (2016). Deep Learning, Ch. 6.2.2: Cost Functions.

[2] Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.


§13 Optimization Algorithms — Step-by-Step Computation

The optimizer is responsible for updating the model's parameters using the computed gradients. Each optimizer defines a different update rule with different memory requirements, computational costs, and convergence properties. We analyze each optimizer in terms of: the exact update equations, per-step FLOPs, additional memory beyond the parameters and gradients, and the hyperparameters that control its behavior.

Throughout this section, $P$ denotes the total parameter count, $\bm{g}_t = \nabla_{\bm{\theta}} L(\bm{\theta}_t)$ denotes the gradient at step $t$, and $\eta$ is the learning rate.

13.1 Stochastic Gradient Descent (SGD)

SGD Update Rule
$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta \, \bm{g}_t$$
Metric Value
Extra state memory 0
Per-step FLOPs $P$ multiplications (scaling) + $P$ subtractions = $2P$
Hyperparameters $\eta$ (learning rate)

SGD is the simplest optimizer with the lowest memory footprint and per-step cost. However, it typically requires many more iterations to converge, so the total training FLOPs (per-step cost × number of steps) may be higher than more sophisticated optimizers.

Optimizer Memory Footprint Comparison (for P parameters, float32) θ (4P) SGD Total: 4P bytes θ (4P) v (4P) Momentum Total: 8P bytes θ (4P) m (4P) v (4P) Adam Total: 12P bytes Full Training Memory Params (θ): 4P bytes (always) Gradients (g): 4P bytes (always) SGD: 8P total SGD+Mom: 12P total (1.5×) Adam: 16P total (2×) For 100M param model (float32): SGD: 0.8GB | Mom: 1.2GB | Adam: 1.6GB
Figure 13.1. Optimizer memory footprint comparison. SGD stores only parameters. Momentum adds one velocity vector ($+4P$ bytes). Adam adds first and second moment vectors ($+8P$ bytes). Including gradients, Adam uses 2× the memory of SGD. For a 100M parameter model in float32, the difference is 0.8 GB.

13.2 SGD with Momentum

Momentum Update Rule

Step 1 — Update velocity:

$$\bm{v}_t = \mu \, \bm{v}_{t-1} + \bm{g}_t$$

Step 2 — Update parameters:

$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta \, \bm{v}_t$$
Metric Value
Extra state memory $\bm{v} \in \R^P$ → $4P$ bytes (float32)
Per-step FLOPs $P$ (scale $v$) + $P$ (add $g$) + $P$ (scale $v$) + $P$ (update $\theta$) = $4P$
Hyperparameters $\eta$ (learning rate), $\mu$ (momentum, typically 0.9)

Momentum accumulates a moving average of past gradients, which provides two benefits: (1) it accelerates convergence in directions of consistent gradient (like a ball rolling downhill), and (2) it dampens oscillation in directions of high curvature. Theoretically, momentum improves convergence from $O(1/t)$ to $O(1/t^2)$ for convex quadratics, which can translate to $\sqrt{\kappa}$ fewer iterations for condition number $\kappa$ (Polyak, 1964).

13.3 Nesterov Accelerated Gradient (NAG)

Nesterov Update Rule

Step 1 — Look-ahead gradient: Evaluate gradient at $\bm{\theta}_t + \mu \bm{v}_{t-1}$ (the anticipated next position)

$$\bm{g}_t = \nabla L(\bm{\theta}_t + \mu \bm{v}_{t-1})$$

Step 2 — Update velocity:

$$\bm{v}_t = \mu \, \bm{v}_{t-1} + \bm{g}_t$$

Step 3 — Update parameters:

$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta \, \bm{v}_t$$
Metric Value
Extra state memory Same as momentum: $4P$ bytes
Per-step FLOPs Same as momentum: $4P$ (plus $P$ for the look-ahead position computation)
Hyperparameters $\eta$, $\mu$ (same as momentum)

NAG has the same memory and nearly the same per-step cost as standard momentum, but often converges faster because the look-ahead gradient provides a form of "correction" — it accounts for where the momentum will carry the parameters, not just where they are now.

Sutskever, I. et al. (2013), "On the importance of initialization and momentum in deep learning"

13.4 AdaGrad

AdaGrad Update Rule

Step 1 — Accumulate squared gradients:

$$\bm{G}_t = \bm{G}_{t-1} + \bm{g}_t \odot \bm{g}_t$$

Step 2 — Update parameters with per-parameter learning rate:

$$\bm{\theta}_{t+1} = \bm{\theta}_t - \frac{\eta}{\sqrt{\bm{G}_t + \epsilon}} \odot \bm{g}_t$$
Metric Value
Extra state memory $\bm{G} \in \R^P$ → $4P$ bytes
Per-step FLOPs $P$ (square) + $P$ (accumulate) + $P$ (sqrt) + $P$ (add $\epsilon$) + $P$ (divide) + $P$ (scale) + $P$ (update) = $\approx 7P$
Hyperparameters $\eta$ (learning rate), $\epsilon$ (stability, $\sim10^{-8}$)

Problem: $\bm{G}_t$ monotonically increases, so the effective learning rate $\eta/\sqrt{G_t}$ monotonically decreases, eventually reaching zero. Training effectively stops. This motivates RMSProp.

Duchi, J. et al. (2011), "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization", JMLR

13.5 RMSProp

RMSProp Update Rule

Step 1 — Exponential moving average of squared gradients:

$$\bm{v}_t = \beta \, \bm{v}_{t-1} + (1 - \beta) \, \bm{g}_t \odot \bm{g}_t$$

Step 2 — Update:

$$\bm{\theta}_{t+1} = \bm{\theta}_t - \frac{\eta}{\sqrt{\bm{v}_t + \epsilon}} \odot \bm{g}_t$$
Metric Value
Extra state memory $\bm{v} \in \R^P$ → $4P$ bytes
Per-step FLOPs $\approx 8P$
Hyperparameters $\eta$, $\beta$ (decay rate, typically 0.9), $\epsilon$

RMSProp fixes AdaGrad's diminishing learning rate by using a moving average instead of a cumulative sum. The effective learning rate for each parameter is inversely proportional to the RMS of recent gradients — parameters with large gradients get smaller steps, and vice versa.

Hinton, G. (2012), Coursera Neural Networks lecture 6e (unpublished but universally cited)

13.6 Adam (Adaptive Moment Estimation)

Adam is the most widely used optimizer in deep learning. It combines the ideas of momentum (tracking the first moment of gradients) and RMSProp (tracking the second moment), with bias correction to handle the initialization at zero.

Adam Update Rule — Full Derivation

Step 1 — Update first moment estimate (momentum-like):

$$\bm{m}_t = \beta_1 \, \bm{m}_{t-1} + (1 - \beta_1) \, \bm{g}_t$$

Step 2 — Update second moment estimate (RMSProp-like):

$$\bm{v}_t = \beta_2 \, \bm{v}_{t-1} + (1 - \beta_2) \, \bm{g}_t \odot \bm{g}_t$$

Step 3 — Bias correction (compensates for zero initialization):

$$\hat{\bm{m}}_t = \frac{\bm{m}_t}{1 - \beta_1^t}, \qquad \hat{\bm{v}}_t = \frac{\bm{v}_t}{1 - \beta_2^t}$$

Step 4 — Update parameters:

$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta \cdot \frac{\hat{\bm{m}}_t}{\sqrt{\hat{\bm{v}}_t} + \epsilon}$$
Adam — Complete Computational Cost
Sub-operation Computation FLOPs
Scale $m_{t-1}$ $\beta_1 \bm{m}_{t-1}$ $P$
Scale gradient $(1-\beta_1)\bm{g}_t$ $P$
Add for $\bm{m}_t$ $+$ $P$
Square gradient $\bm{g}_t \odot \bm{g}_t$ $P$
Scale $v_{t-1}$ $\beta_2 \bm{v}_{t-1}$ $P$
Scale squared gradient $(1-\beta_2)\bm{g}_t^2$ $P$
Add for $\bm{v}_t$ $+$ $P$
Bias correct $\hat{m}$ $\bm{m}_t / (1-\beta_1^t)$ $P + 1$
Bias correct $\hat{v}$ $\bm{v}_t / (1-\beta_2^t)$ $P + 1$
Square root $\sqrt{\hat{\bm{v}}_t}$ $P$
Add epsilon $\sqrt{\hat{\bm{v}}_t} + \epsilon$ $P$
Division $\hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon)$ $P$
Scale by LR $\times \eta$ $P$
Update params $\bm{\theta}_t -$ $P$
Total per step $\approx 14P$ FLOPs

Memory: $2 \times 4P = 8P$ bytes extra (for $\bm{m}$ and $\bm{v}$). Including parameters and gradients: $\mathbf{16P}$ bytes total.

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\eta$ (problem-dependent, often $10^{-3}$).

Why Bias Correction Matters

At step $t=1$: $\bm{m}_1 = (1-\beta_1)\bm{g}_1$. Without correction, this is $(1-0.9)\bm{g}_1 = 0.1\bm{g}_1$ — the first moment estimate is biased toward zero by a factor of 10. The correction divides by $(1-0.9^1) = 0.1$, recovering $\hat{\bm{m}}_1 = \bm{g}_1$.

For $\bm{v}$: at $t=1$, $\bm{v}_1 = 0.001 \bm{g}_1^2$. Without correction, the effective learning rate is $\eta / \sqrt{0.001 \bm{g}_1^2} = \eta / (0.0316 |\bm{g}_1|)$ — about 30× too large. Correction divides by $1-0.999^1 = 0.001$, giving $\hat{\bm{v}}_1 = \bm{g}_1^2$ — the correct scale.

Bias correction costs $2P + 2$ FLOPs (negligible) but prevents catastrophic early updates.

Reference

[3] Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.

13.7 AdamW (Adam with Decoupled Weight Decay)

AdamW Update Rule

The only difference from Adam is in the parameter update step. Instead of adding L2 regularization to the loss (which gets entangled with the adaptive learning rate), weight decay is applied directly:

$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta\left(\frac{\hat{\bm{m}}_t}{\sqrt{\hat{\bm{v}}_t} + \epsilon} + \lambda \bm{\theta}_t\right)$$

where $\lambda$ is the weight decay coefficient. The key insight is that in Adam, L2 regularization and weight decay are not equivalent (unlike in SGD), because Adam's adaptive scaling distorts the regularization gradient. Decoupling them restores proper weight decay behavior.

Metric Value
Extra FLOPs vs. Adam $+2P$ (scale $\theta$ by $\lambda$, add to update)
Extra memory vs. Adam 0
Extra hyperparameters $\lambda$ (weight decay, typically $10^{-2}$)

Loshchilov, I. & Hutter, F. (2019), "Decoupled Weight Decay Regularization", ICLR 2019

13.8 Optimizer Comparison — Master Table

Optimizer Extra Memory Per-Step FLOPs Hyperparams Key Property
SGD 0 $2P$ $\eta$ Simplest, lowest memory
SGD + Momentum $4P$ bytes $4P$ $\eta, \mu$ Accelerated convergence
Nesterov $4P$ bytes $5P$ $\eta, \mu$ Look-ahead correction
AdaGrad $4P$ bytes $7P$ $\eta, \epsilon$ Per-param LR; decays to 0
RMSProp $4P$ bytes $8P$ $\eta, \beta, \epsilon$ Fixes AdaGrad decay
Adam $\mathbf{8P}$ bytes $14P$ $\eta, \beta_1, \beta_2, \epsilon$ Combined momentum + RMSProp
AdamW $8P$ bytes $16P$ $\eta, \beta_1, \beta_2, \epsilon, \lambda$ Proper weight decay for Adam
Numerical Comparison — ResNet-50 (25.6M params)
Optimizer State Memory Per-Step Optimizer FLOPs % of Backprop FLOPs
SGD 0 MB 51.2M 0.67%
Momentum 97.7 MB 102.4M 1.35%
Adam 195.3 MB 358.4M 4.72%

ResNet-50 backward pass: ~7.6G FLOPs. The optimizer update adds at most 4.7% extra FLOPs — the optimizer is never the computational bottleneck. However, the extra memory (195 MB for Adam) can be a constraint for large models on limited GPU memory.

References for §13

[3] Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.

[4] Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.

[5] Sutskever, I. et al. (2013). On the importance of initialization and momentum in deep learning. ICML 2013.

[6] Duchi, J. et al. (2011). Adaptive Subgradient Methods. JMLR, 12, 2121–2159.

[7] Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.

[1] Goodfellow et al. (2016). Deep Learning, Ch. 8: Optimization for Training Deep Models.

[8] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.


§14 Regularization Techniques — Computational Cost

Regularization prevents overfitting by constraining the model's capacity. From a computational efficiency perspective, the key question is: what is the per-step cost of each regularization technique, and how does it affect the total number of training steps needed?

14.1 L1 Regularization (Lasso)

L1 Regularization
$$L_{\text{reg}} = L_{\text{data}} + \lambda \sum_{i=1}^{P} |w_i|$$

Gradient contribution: $\pd{L_{\text{reg}}}{w_i} = \pd{L_{\text{data}}}{w_i} + \lambda \cdot \text{sign}(w_i)$

Metric Value
Extra forward FLOPs $P$ (absolute values) + $P$ (sum) = $2P$
Extra gradient FLOPs $P$ (sign) + $P$ (scale by $\lambda$) + $P$ (add to gradient) = $3P$
Extra memory 0

Key property: L1 promotes sparsity — it drives many weights exactly to zero. Sparse weights can theoretically be exploited for computational savings at inference (via pruning), though in practice this requires hardware/software support for sparse operations.

14.2 L2 Regularization / Weight Decay

L2 Regularization
$$L_{\text{reg}} = L_{\text{data}} + \frac{\lambda}{2}\sum_{i=1}^{P} w_i^2 = L_{\text{data}} + \frac{\lambda}{2}\|\bm{W}\|_F^2$$

Gradient contribution: $\pd{L_{\text{reg}}}{w_i} = \pd{L_{\text{data}}}{w_i} + \lambda w_i$

Metric Value
Extra gradient FLOPs $P$ (scale $w_i$ by $\lambda$) + $P$ (add to gradient) = $2P$
Extra memory 0 (uses existing parameter values)

In SGD, L2 regularization is equivalent to weight decay: $w_i \leftarrow w_i(1 - \eta\lambda) - \eta \pd{L}{w_i}$. The factor $(1-\eta\lambda)$ simply shrinks each weight by a constant fraction per step. This costs $P$ extra multiplications — negligible.

14.3 Dropout

Dropout (Inverted)

During training, for each activation $a_j$ in a layer with dropout rate $p$ (probability of dropping):

$$\text{mask}_j \sim \text{Bernoulli}(1-p)$$ $$\tilde{a}_j = \frac{a_j \cdot \text{mask}_j}{1-p}$$

During inference: no dropout applied, no scaling needed (because of the $1/(1-p)$ factor during training — this is "inverted dropout").

Dropout — Complete Computational Cost

For a layer with $n$ activations, batch size $B$:

Operation FLOPs (Training) FLOPs (Inference)
Generate random mask $Bn$ random number generations 0
Threshold to binary $Bn$ comparisons 0
Multiply by mask $Bn$ multiplications 0
Scale by $1/(1-p)$ $Bn$ multiplications 0
Total $\approx 4Bn$ FLOPs + RNG cost 0 (free at inference)

Extra memory (training): $Bn$ values for the binary mask — needed for the backward pass to apply the same mask to gradients.

Backward pass: $\pd{L}{\bm{a}} = \pd{L}{\tilde{\bm{a}}} \odot \frac{\text{mask}}{1-p}$ — same element-wise operation, $2Bn$ FLOPs.

Dropout Efficiency Analysis

Dropout adds $\approx 4Bn$ FLOPs per layer during training. For a hidden layer of size 4096 with $B=32$: $4 \times 32 \times 4096 = 524{,}288$ FLOPs. The matrix multiplication for that same layer (assuming 4096→4096): $2 \times 32 \times 4096 \times 4096 = 1.07$G FLOPs. Dropout overhead: 0.05% — completely negligible.

The random number generation is typically the bottleneck (cuRAND on GPU), not the arithmetic. But even this is small relative to matmul cost.

At inference, dropout costs exactly zero — this makes dropout "free" at deployment time.

Reference

[9] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 1929–1958.

14.4 Data Augmentation

Data Augmentation — Computational Cost

Data augmentation (flipping, rotation, cropping, color jitter, mixup, cutout, etc.) is applied to input data on the CPU during data loading, typically in a separate process/thread that runs in parallel with GPU computation.

Metric Value
Extra GPU FLOPs 0 (CPU preprocessing)
Extra GPU memory 0
Training throughput impact Typically zero if data pipeline is not bottlenecked
Extra training data Effectively infinite (stochastic transforms)

Data augmentation is unique among regularization techniques: it costs virtually nothing computationally but provides significant regularization benefit. This makes it one of the most FLOP-efficient ways to improve generalization.

References for §14

[9] Srivastava et al. (2014). Dropout. JMLR.

[1] Goodfellow et al. (2016). Deep Learning, Ch. 7: Regularization for Deep Learning.

[10] DeVries, T. & Taylor, G. W. (2017). Improved Regularization of Convolutional Neural Networks with Cutout. arXiv:1708.04552.

[11] Zhang, H. et al. (2018). mixup: Beyond Empirical Risk Minimization. ICLR 2018.


§15 Normalization Layers — Complete Analysis

Normalization layers standardize intermediate activations, which stabilizes training, enables larger learning rates, and reduces the number of iterations needed to converge. They add both parameters and per-step FLOPs, but the reduction in training iterations typically more than compensates. Understanding their exact computational cost is critical for efficiency analysis.

15.1 Batch Normalization (BatchNorm)

Batch Normalization — Forward Pass

For activations $\bm{x} \in \R^{B \times d}$ (or $\R^{B \times C \times H \times W}$ for CNNs, where statistics are computed per-channel across $B$, $H$, $W$):

Step 1 — Batch mean:

$$\mu_j = \frac{1}{N}\sum_{i=1}^{N} x_{ij}$$

where $N = B$ for FC layers, $N = B \times H \times W$ for conv layers.

Step 2 — Batch variance:

$$\sigma_j^2 = \frac{1}{N}\sum_{i=1}^{N}(x_{ij} - \mu_j)^2$$

Step 3 — Normalize:

$$\hat{x}_{ij} = \frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}$$

Step 4 — Scale and shift (learnable):

$$y_{ij} = \gamma_j \hat{x}_{ij} + \beta_j$$
BatchNorm — Complete Computational Cost

Let $d$ = number of features (or channels for CNNs), $N$ = number of elements per feature contributing to the statistic ($B$ for FC, $BHW$ for conv).

Step Operation FLOPs
1. Mean Sum $N$ values, divide $Nd + d \approx Nd$
2. Variance $(x-\mu)^2$: $Nd$ subs + $Nd$ squarings; sum; divide $3Nd$
3. Normalize $(x-\mu)$: already computed; $\sqrt{\sigma^2+\epsilon}$: $d$ sqrts; divide: $Nd$ $Nd + d$
4. Scale+shift $\gamma \hat{x} + \beta$: $Nd$ mults + $Nd$ adds $2Nd$
Total forward $\approx 7Nd + 2d$
Running stats update EMA for $\mu$, $\sigma^2$: $2d$ mults + $2d$ adds $4d$

For a CNN conv layer: $N = B \times H \times W$, $d = C$ (channels).

Learnable parameters: $\gamma, \beta \in \R^C$ → $2C$ parameters

Running statistics (inference only): running mean + running variance → $2C$ values (not learnable, not parameters)

Memory for backprop: Must cache $\hat{\bm{x}}$ (normalized values), $\mu$, $\sigma^2$ → $Nd + 2d$ elements.

BatchNorm — Backward Pass (Full Derivation)

Given upstream gradient $\pd{L}{y_{ij}}$, we need $\pd{L}{x_{ij}}$, $\pd{L}{\gamma_j}$, and $\pd{L}{\beta_j}$.

(a) Gradient w.r.t. $\gamma$:

$$\pd{L}{\gamma_j} = \sum_{i=1}^{N} \pd{L}{y_{ij}} \cdot \hat{x}_{ij}$$

FLOPs: $Nd$ multiplications + $Nd$ additions = $2Nd$

(b) Gradient w.r.t. $\beta$:

$$\pd{L}{\beta_j} = \sum_{i=1}^{N} \pd{L}{y_{ij}}$$

FLOPs: $Nd$ additions

(c) Gradient w.r.t. input $x$ (the complex part):

$$\pd{L}{\hat{x}_{ij}} = \pd{L}{y_{ij}} \cdot \gamma_j$$ $$\pd{L}{\sigma_j^2} = \sum_{i=1}^{N}\pd{L}{\hat{x}_{ij}} \cdot (x_{ij} - \mu_j) \cdot \left(-\frac{1}{2}\right)(\sigma_j^2+\epsilon)^{-3/2}$$ $$\pd{L}{\mu_j} = \sum_{i=1}^{N}\pd{L}{\hat{x}_{ij}} \cdot \frac{-1}{\sqrt{\sigma_j^2+\epsilon}} + \pd{L}{\sigma_j^2}\cdot\frac{-2}{N}\sum_{i=1}^{N}(x_{ij}-\mu_j)$$ $$\pd{L}{x_{ij}} = \pd{L}{\hat{x}_{ij}}\cdot\frac{1}{\sqrt{\sigma_j^2+\epsilon}} + \pd{L}{\sigma_j^2}\cdot\frac{2(x_{ij}-\mu_j)}{N} + \pd{L}{\mu_j}\cdot\frac{1}{N}$$

Total backward FLOPs: approximately $\mathbf{10Nd + \text{lower order terms}}$

BatchNorm Cost — Concrete Example

Conv layer output: $B=32$, $C=256$, $H=W=56$.

$N = 32 \times 56 \times 56 = 100{,}352$. $d = 256$.

Forward: $7 \times 100{,}352 \times 256 \approx 180$M FLOPs

Backward: $\approx 10 \times 100{,}352 \times 256 \approx 257$M FLOPs

Compared to the conv layer itself: If the preceding 3×3 conv (128→256) has ~14.8G FLOPs, then BatchNorm forward adds 180M / 14.8G ≈ 1.2% overhead. BatchNorm is computationally cheap relative to convolution.

Parameters: $2 \times 256 = 512$ — negligible vs. the conv layer's ~295K params.

BatchNorm — Training vs. Inference
Training Inference
Uses Batch statistics ($\mu_B$, $\sigma_B^2$) Running statistics ($\mu_{\text{run}}$, $\sigma_{\text{run}}^2$)
Compute per element ~7 FLOPs (forward) ~4 FLOPs (normalize + scale/shift)
Batch dependence Yes — stats depend on entire batch No — uses fixed running stats
Can fuse with conv No (stats depend on data) Yes — BN can be folded into preceding conv weights!

BN Fusion at inference: Since inference uses fixed $\mu$, $\sigma$, $\gamma$, $\beta$, the entire BN operation is a linear transform: $y = \frac{\gamma}{\sigma}x + (\beta - \frac{\gamma\mu}{\sigma})$. This can be folded into the preceding conv/FC layer's weights and biases, making BN zero-cost at inference.

Reference

[12] Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015.

15.2 Layer Normalization

Layer Normalization

Instead of normalizing across the batch (per feature), LayerNorm normalizes across features (per sample):

$$\mu_i = \frac{1}{d}\sum_{j=1}^{d} x_{ij}, \qquad \sigma_i^2 = \frac{1}{d}\sum_{j=1}^{d}(x_{ij} - \mu_i)^2$$ $$y_{ij} = \gamma_j \cdot \frac{x_{ij} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}} + \beta_j$$
Metric Value
Normalization axis Features (per sample)
Parameters $2d$ ($\gamma, \beta$)
Forward FLOPs $\approx 7Bd$ (same order as BatchNorm)
Batch dependence None — each sample normalized independently
Can fuse at inference No (statistics are data-dependent per sample)

LayerNorm is the standard normalization for Transformers and RNNs, where batch statistics are unreliable (variable sequence lengths, small effective batches). Its per-element computational cost is identical to BatchNorm, but it cannot be fused at inference — it always costs ~4 FLOPs per element.

15.3 Group Normalization and Instance Normalization

Group Normalization (GN)

Divides $C$ channels into $G$ groups and normalizes within each group independently. For each group $g$ containing channels $c_{g,1}, \ldots, c_{g,C/G}$:

Statistics are computed over the $C/G$ channels and all spatial positions $H \times W$ within each sample. Each statistic is computed over $(C/G) \times H \times W$ elements.

$G$ Method Normalize Over
$G = 1$ Layer Normalization All $C \times H \times W$ elements
$1 < G < C$ Group Normalization $(C/G) \times H \times W$ elements
$G = C$ Instance Normalization $H \times W$ elements per channel

Parameters: $2C$ (same as BatchNorm). FLOPs: Same order as BatchNorm. Batch dependence: None.

Wu, Y. & He, K. (2018), "Group Normalization", ECCV 2018

15.4 Normalization Methods — Comparison Table

Method Normalize Over Params FLOPs Batch Dep. Fuse@Inf Best For
BatchNorm $B \times (H \times W)$ per channel $2C$ $\sim 7NdC$ Yes Yes CNNs (large batch)
LayerNorm $d$ features per sample $2d$ $\sim 7Bd$ No No Transformers, RNNs
GroupNorm $(C/G) \times H \times W$ per sample $2C$ $\sim 7BCHW$ No No CNNs (small batch)
InstanceNorm $H \times W$ per channel per sample $2C$ $\sim 7BCHW$ No No Style transfer
Normalization Methods — What They Normalize Over Tensor shape: (B, C, H, W). Blue regions show what is normalized together. BatchNorm B C (channels) LayerNorm GroupNorm (G=2) InstanceNorm
Figure 15.1. Visualization of what each normalization method normalizes over, shown as colored regions on a (Batch × Channel) tensor. BatchNorm normalizes a column (one channel across all samples). LayerNorm normalizes a row (all channels for one sample). GroupNorm normalizes a subgroup of channels for one sample. InstanceNorm normalizes one channel for one sample. The spatial dimensions ($H \times W$) are always included in the normalization (not shown).

References for §15

[12] Ioffe, S. & Szegedy, C. (2015). Batch Normalization. ICML 2015.

[13] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450.

[14] Wu, Y. & He, K. (2018). Group Normalization. ECCV 2018.

[15] Santurkar, S. et al. (2018). How Does Batch Normalization Help Optimization? NeurIPS 2018.


§16 Learning Rate Scheduling

The learning rate schedule determines how the learning rate $\eta$ changes over the course of training. While the per-step cost of computing a new learning rate is negligible ($O(1)$ FLOPs), the choice of schedule profoundly affects convergence speed and therefore total training FLOPs. A good schedule can halve the number of training epochs needed.

16.1 Step Decay

Step Decay
$$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / S \rfloor}$$

where $\gamma \in (0,1)$ is the decay factor (typically 0.1) and $S$ is the step interval (in epochs). Commonly: reduce LR by 10× every 30 epochs.

Computational cost: 1 floor division + 1 power + 1 multiplication = $O(1)$ per step.

16.2 Cosine Annealing

Cosine Annealing
$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\!\left(\frac{t}{T}\pi\right)\right)$$

where $T$ is the total number of training steps.

Computational cost: 1 cosine + 3 multiplications + 2 additions = $O(1)$.

Cosine annealing smoothly decreases the learning rate from $\eta_{\max}$ to $\eta_{\min}$, spending more time at moderate learning rates than step decay. Empirically, it often gives slightly better final accuracy.

Loshchilov, I. & Hutter, F. (2017), "SGDR: Stochastic Gradient Descent with Warm Restarts", ICLR 2017

16.3 Warmup

Linear Warmup
$$\eta_t = \eta_{\text{target}} \cdot \frac{t}{T_{\text{warmup}}} \qquad \text{for } t \leq T_{\text{warmup}}$$

After warmup: transition to the main schedule (cosine, step decay, etc.).

Warmup prevents large, destabilizing updates in the early stages of training when: (a) the network is far from any reasonable solution, (b) batch statistics (for BatchNorm) are unreliable, or (c) the Adam second moment estimates are unreliable (bias correction partially addresses this, but warmup helps further).

Goyal, P. et al. (2017), "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour"

Warmup — Efficiency Implication

Warmup adds a small number of low-learning-rate steps at the start (typically 5–10 epochs out of 90+). These steps produce small weight updates and may seem "wasted." However, without warmup, large initial LR can cause divergence, requiring the entire training run to be restarted. Warmup is therefore a form of insurance: $\approx$5% extra steps to avoid $\gg$100% wasted steps from divergence.

16.4 One-Cycle Policy

One-Cycle Policy (Smith & Topin, 2019)

The learning rate follows a cycle:

Phase 1 (warmup, ~30% of training): LR increases linearly from $\eta_{\text{low}}$ to $\eta_{\text{max}}$

Phase 2 (annealing, ~70% of training): LR decreases via cosine from $\eta_{\text{max}}$ to $\eta_{\text{low}}/10$

The key finding: using a much larger $\eta_{\text{max}}$ than typical (found via LR range test) enables "super-convergence" — training to the same accuracy in 5–10× fewer steps than standard schedules.

Computational implication: If one-cycle reaches target accuracy in 30 epochs instead of 90, the total training FLOPs are reduced by 3×. The per-step cost is identical; only the number of steps differs.

Smith, L. N. & Topin, N. (2019), "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates"

Learning Rate Schedules — Visual Comparison Training Steps Learning Rate (η) Step Decay Cosine Annealing One-Cycle warmup
Figure 16.1. Three common learning rate schedules. Step decay reduces LR abruptly at fixed intervals. Cosine annealing provides a smooth decay. One-cycle includes a warmup phase where LR increases, then decays — often enabling convergence in fewer total steps.

References for §16

[16] Loshchilov, I. & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017.

[17] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.

[18] Smith, L. N. & Topin, N. (2019). Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. AISTATS 2019.


§ Computational Summary — All Training Components

Per-Step Cost of All Training Components ($P$ parameters, batch size $B$, $d$ features per layer)

Component Extra FLOPs per Step Extra Memory (bytes) % of Training Cost
Forward pass $F$ Activations: $O(BL\bar{d})$ ~33%
Backward pass $\approx 2F$ Gradients: $4P$ ~66%
Loss computation $O(BC)$ $O(BC)$ <0.01%
Optimizer (SGD) $2P$ 0 <1%
Optimizer (Adam) $14P$ $8P$ ~2–5%
L2 regularization $2P$ 0 <0.5%
Dropout (per layer) $\approx 4Bn$ $4Bn$ (mask) <0.1%
BatchNorm (per layer, fwd) $\approx 7Nd$ $4(Nd + 2d)$ ~1–2%
BatchNorm (per layer, bwd) $\approx 10Nd$ ~2–3%
LR schedule $O(1)$ 0 $\approx 0$
Key Takeaway for Efficiency Research

The forward and backward passes through the network layers (matrix multiplications and convolutions) account for ~95–98% of total training FLOPs. All other components — optimizer updates, normalization, regularization, loss computation — are collectively less than 5%. Therefore:

(1) Reducing model FLOPs (efficient architectures, pruning, quantization) is the primary lever for training efficiency.

(2) Reducing the number of training steps (better optimizers, LR schedules, initialization) is the secondary lever — it multiplies the total by a smaller constant.

(3) Optimizer memory (not FLOPs) is often the binding constraint — Adam's $8P$ extra bytes can be the difference between fitting a model on a GPU or not.

Total Training Memory Formula

Peak Memory During Training (float32)
$$\text{Memory}_{\text{peak}} = \underbrace{4P}_{\text{params}} + \underbrace{4P}_{\text{gradients}} + \underbrace{4M_{\text{opt}}P}_{\text{optimizer state}} + \underbrace{4\sum_{l}\text{act}_l(B)}_{\text{cached activations}} + \underbrace{\text{buffers}}_{\text{im2col, workspace}}$$

where $M_{\text{opt}} = 0$ (SGD), $1$ (Momentum), $2$ (Adam), and $\text{act}_l(B)$ is the activation tensor size at layer $l$ for batch size $B$.

For ResNet-50 (25.6M params), Adam, B=32, 224×224 input:

Params: 97.7 MB | Grads: 97.7 MB | Adam state: 195.3 MB | Activations: ~2.5 GB | Total: ~2.9 GB

Activations dominate for this configuration — a common pattern in modern CNNs.


§ Complete References

Loss Functions

[1] Goodfellow et al. (2016). Deep Learning, Ch. 6.2.2 (Cost Functions), Ch. 8 (Optimization).

[2] Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.

Optimizers

[3] Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.

[4] Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.

[5] Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML 2013.

[6] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR, 12, 2121–2159.

[7] Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.

[8] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Comp. Math. and Math. Physics, 4(5), 1–17.

Regularization

[9] Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 1929–1958.

[10] DeVries, T. & Taylor, G. W. (2017). Improved Regularization with Cutout. arXiv:1708.04552.

[11] Zhang, H. et al. (2018). mixup: Beyond Empirical Risk Minimization. ICLR 2018.

Normalization

[12] Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training. ICML 2015.

[13] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450.

[14] Wu, Y. & He, K. (2018). Group Normalization. ECCV 2018.

[15] Santurkar, S. et al. (2018). How Does Batch Normalization Help Optimization? NeurIPS 2018.

Learning Rate Schedules

[16] Loshchilov, I. & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017.

[17] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.

[18] Smith, L. N. & Topin, N. (2019). Super-Convergence. AISTATS 2019.

Surveys

[19] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.

Back to Notebook Index
Total visits:
§
Page visits: