The loss function is the objective being minimized by gradient descent. Its computational cost appears in every training step (forward pass), and its gradient is the starting point of backpropagation. While the loss computation itself is typically cheap compared to the network's forward pass, the properties of the loss function (smoothness, gradient behavior near zero) have profound effects on convergence speed and therefore total training FLOPs.
where $B$ is batch size and $d$ is the output dimension (for regression) or number of classes.
Gradient:
$$\pd{L}{\hat{y}_{ij}} = \frac{2}{Bd}(\hat{y}_{ij} - y_{ij})$$| Operation | FLOPs |
|---|---|
| Difference: $\hat{y}_{ij} - y_{ij}$ | $Bd$ subtractions |
| Square: $(\cdot)^2$ | $Bd$ multiplications |
| Sum over $d$ and $B$ | $Bd - 1$ additions |
| Division by $Bd$ | 1 division |
| Total forward | $\approx 3Bd$ FLOPs |
| Gradient computation | $Bd$ subtractions + $Bd$ multiplications = $2Bd$ FLOPs |
Connection to MLE: Minimizing MSE is equivalent to maximizing the log-likelihood under a Gaussian model $P(y|\hat{y}) = \mathcal{N}(\hat{y}, \sigma^2 I)$ with fixed variance. The gradient $\propto (\hat{y} - y)$ pushes predictions toward targets linearly — no saturation, but also no gradient boost for badly wrong predictions.
where $\hat{y}_i = \sigma(z_i) \in (0,1)$.
Gradient w.r.t. logit $z_i$ (fused with sigmoid):
$$\pd{L}{z_i} = \frac{1}{B}(\hat{y}_i - y_i)$$For one-hot labels (standard case): $L_{\text{CE}} = -\frac{1}{B}\sum_{i=1}^{B}\log(\hat{y}_{i,k_i})$ where $k_i$ is the true class.
Gradient w.r.t. logits (fused with softmax):
$$\pd{L}{\bm{z}_i} = \frac{1}{B}(\hat{\bm{y}}_i - \bm{y}_i) \in \R^C$$| Variant | Forward FLOPs | Gradient FLOPs | Notes |
|---|---|---|---|
| BCE (one-hot) | $\approx 4B$ (2 logs, 2 mults, sum) | $B$ (subtraction) | Fused with sigmoid |
| CE, one-hot labels | $\approx 2B$ (1 log per sample + sum) | $BC$ (subtraction) | Fused with softmax |
| CE, soft labels | $\approx 3BC$ (C logs per sample) | $BC$ (subtraction) | Label smoothing, distillation |
Numerical stability: The fused log_softmax computation avoids computing
$\log(\text{softmax})$ separately (which loses precision for extreme logits). Instead:
This log-sum-exp trick costs $\approx 3C$ FLOPs per sample (max + shift + sum of exp + log) and is numerically stable.
Consider a sample where the true class is $k$ and the predicted probability $\hat{y}_k \approx 0$ (very wrong prediction):
CE gradient: $\pd{L}{z_k} \approx \hat{y}_k - 1 \approx -1$ — strong, constant signal to correct the mistake.
MSE gradient: $\pd{L}{\hat{y}_k} \propto \hat{y}_k - 1 \approx -1$, but this must pass through $\sigma'(z_k)$ which is near zero when $\hat{y}_k$ is near 0 or 1 — the gradient vanishes.
Cross-entropy's gradient is stronger when predictions are badly wrong, which leads to faster convergence and therefore fewer total training FLOPs to reach a given accuracy. This is why cross-entropy is universally preferred for classification.
| Loss | Formula | Forward FLOPs | Use Case |
|---|---|---|---|
| Hinge | $\max(0, 1 - y \cdot \hat{y})$ | $3B$ (mult, sub, max) | SVMs, some classifiers |
| Huber | $\frac{1}{2}(y-\hat{y})^2$ if $|y-\hat{y}| \leq \delta$, else $\delta|y-\hat{y}|-\frac{1}{2}\delta^2$ | $\approx 4Bd$ | Robust regression |
| Focal | $-\alpha_t(1-\hat{y}_t)^\gamma \log(\hat{y}_t)$ | $\approx 6BC$ | Class imbalance (detection) |
| KL Divergence | $\sum p(x)\log\frac{p(x)}{q(x)}$ | $\approx 5BC$ | Knowledge distillation, VAEs |
[1] Goodfellow et al. (2016). Deep Learning, Ch. 6.2.2: Cost Functions.
[2] Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.
The optimizer is responsible for updating the model's parameters using the computed gradients. Each optimizer defines a different update rule with different memory requirements, computational costs, and convergence properties. We analyze each optimizer in terms of: the exact update equations, per-step FLOPs, additional memory beyond the parameters and gradients, and the hyperparameters that control its behavior.
Throughout this section, $P$ denotes the total parameter count, $\bm{g}_t = \nabla_{\bm{\theta}} L(\bm{\theta}_t)$ denotes the gradient at step $t$, and $\eta$ is the learning rate.
| Metric | Value |
|---|---|
| Extra state memory | 0 |
| Per-step FLOPs | $P$ multiplications (scaling) + $P$ subtractions = $2P$ |
| Hyperparameters | $\eta$ (learning rate) |
SGD is the simplest optimizer with the lowest memory footprint and per-step cost. However, it typically requires many more iterations to converge, so the total training FLOPs (per-step cost × number of steps) may be higher than more sophisticated optimizers.
Step 1 — Update velocity:
$$\bm{v}_t = \mu \, \bm{v}_{t-1} + \bm{g}_t$$Step 2 — Update parameters:
$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta \, \bm{v}_t$$| Metric | Value |
|---|---|
| Extra state memory | $\bm{v} \in \R^P$ → $4P$ bytes (float32) |
| Per-step FLOPs | $P$ (scale $v$) + $P$ (add $g$) + $P$ (scale $v$) + $P$ (update $\theta$) = $4P$ |
| Hyperparameters | $\eta$ (learning rate), $\mu$ (momentum, typically 0.9) |
Momentum accumulates a moving average of past gradients, which provides two benefits: (1) it accelerates convergence in directions of consistent gradient (like a ball rolling downhill), and (2) it dampens oscillation in directions of high curvature. Theoretically, momentum improves convergence from $O(1/t)$ to $O(1/t^2)$ for convex quadratics, which can translate to $\sqrt{\kappa}$ fewer iterations for condition number $\kappa$ (Polyak, 1964).
Step 1 — Look-ahead gradient: Evaluate gradient at $\bm{\theta}_t + \mu \bm{v}_{t-1}$ (the anticipated next position)
$$\bm{g}_t = \nabla L(\bm{\theta}_t + \mu \bm{v}_{t-1})$$Step 2 — Update velocity:
$$\bm{v}_t = \mu \, \bm{v}_{t-1} + \bm{g}_t$$Step 3 — Update parameters:
$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta \, \bm{v}_t$$| Metric | Value |
|---|---|
| Extra state memory | Same as momentum: $4P$ bytes |
| Per-step FLOPs | Same as momentum: $4P$ (plus $P$ for the look-ahead position computation) |
| Hyperparameters | $\eta$, $\mu$ (same as momentum) |
NAG has the same memory and nearly the same per-step cost as standard momentum, but often converges faster because the look-ahead gradient provides a form of "correction" — it accounts for where the momentum will carry the parameters, not just where they are now.
Sutskever, I. et al. (2013), "On the importance of initialization and momentum in deep learning"
Step 1 — Accumulate squared gradients:
$$\bm{G}_t = \bm{G}_{t-1} + \bm{g}_t \odot \bm{g}_t$$Step 2 — Update parameters with per-parameter learning rate:
$$\bm{\theta}_{t+1} = \bm{\theta}_t - \frac{\eta}{\sqrt{\bm{G}_t + \epsilon}} \odot \bm{g}_t$$| Metric | Value |
|---|---|
| Extra state memory | $\bm{G} \in \R^P$ → $4P$ bytes |
| Per-step FLOPs | $P$ (square) + $P$ (accumulate) + $P$ (sqrt) + $P$ (add $\epsilon$) + $P$ (divide) + $P$ (scale) + $P$ (update) = $\approx 7P$ |
| Hyperparameters | $\eta$ (learning rate), $\epsilon$ (stability, $\sim10^{-8}$) |
Problem: $\bm{G}_t$ monotonically increases, so the effective learning rate $\eta/\sqrt{G_t}$ monotonically decreases, eventually reaching zero. Training effectively stops. This motivates RMSProp.
Duchi, J. et al. (2011), "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization", JMLR
Step 1 — Exponential moving average of squared gradients:
$$\bm{v}_t = \beta \, \bm{v}_{t-1} + (1 - \beta) \, \bm{g}_t \odot \bm{g}_t$$Step 2 — Update:
$$\bm{\theta}_{t+1} = \bm{\theta}_t - \frac{\eta}{\sqrt{\bm{v}_t + \epsilon}} \odot \bm{g}_t$$| Metric | Value |
|---|---|
| Extra state memory | $\bm{v} \in \R^P$ → $4P$ bytes |
| Per-step FLOPs | $\approx 8P$ |
| Hyperparameters | $\eta$, $\beta$ (decay rate, typically 0.9), $\epsilon$ |
RMSProp fixes AdaGrad's diminishing learning rate by using a moving average instead of a cumulative sum. The effective learning rate for each parameter is inversely proportional to the RMS of recent gradients — parameters with large gradients get smaller steps, and vice versa.
Hinton, G. (2012), Coursera Neural Networks lecture 6e (unpublished but universally cited)
Adam is the most widely used optimizer in deep learning. It combines the ideas of momentum (tracking the first moment of gradients) and RMSProp (tracking the second moment), with bias correction to handle the initialization at zero.
Step 1 — Update first moment estimate (momentum-like):
$$\bm{m}_t = \beta_1 \, \bm{m}_{t-1} + (1 - \beta_1) \, \bm{g}_t$$Step 2 — Update second moment estimate (RMSProp-like):
$$\bm{v}_t = \beta_2 \, \bm{v}_{t-1} + (1 - \beta_2) \, \bm{g}_t \odot \bm{g}_t$$Step 3 — Bias correction (compensates for zero initialization):
$$\hat{\bm{m}}_t = \frac{\bm{m}_t}{1 - \beta_1^t}, \qquad \hat{\bm{v}}_t = \frac{\bm{v}_t}{1 - \beta_2^t}$$Step 4 — Update parameters:
$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta \cdot \frac{\hat{\bm{m}}_t}{\sqrt{\hat{\bm{v}}_t} + \epsilon}$$| Sub-operation | Computation | FLOPs |
|---|---|---|
| Scale $m_{t-1}$ | $\beta_1 \bm{m}_{t-1}$ | $P$ |
| Scale gradient | $(1-\beta_1)\bm{g}_t$ | $P$ |
| Add for $\bm{m}_t$ | $+$ | $P$ |
| Square gradient | $\bm{g}_t \odot \bm{g}_t$ | $P$ |
| Scale $v_{t-1}$ | $\beta_2 \bm{v}_{t-1}$ | $P$ |
| Scale squared gradient | $(1-\beta_2)\bm{g}_t^2$ | $P$ |
| Add for $\bm{v}_t$ | $+$ | $P$ |
| Bias correct $\hat{m}$ | $\bm{m}_t / (1-\beta_1^t)$ | $P + 1$ |
| Bias correct $\hat{v}$ | $\bm{v}_t / (1-\beta_2^t)$ | $P + 1$ |
| Square root | $\sqrt{\hat{\bm{v}}_t}$ | $P$ |
| Add epsilon | $\sqrt{\hat{\bm{v}}_t} + \epsilon$ | $P$ |
| Division | $\hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon)$ | $P$ |
| Scale by LR | $\times \eta$ | $P$ |
| Update params | $\bm{\theta}_t -$ | $P$ |
| Total per step | $\approx 14P$ FLOPs |
Memory: $2 \times 4P = 8P$ bytes extra (for $\bm{m}$ and $\bm{v}$). Including parameters and gradients: $\mathbf{16P}$ bytes total.
Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\eta$ (problem-dependent, often $10^{-3}$).
At step $t=1$: $\bm{m}_1 = (1-\beta_1)\bm{g}_1$. Without correction, this is $(1-0.9)\bm{g}_1 = 0.1\bm{g}_1$ — the first moment estimate is biased toward zero by a factor of 10. The correction divides by $(1-0.9^1) = 0.1$, recovering $\hat{\bm{m}}_1 = \bm{g}_1$.
For $\bm{v}$: at $t=1$, $\bm{v}_1 = 0.001 \bm{g}_1^2$. Without correction, the effective learning rate is $\eta / \sqrt{0.001 \bm{g}_1^2} = \eta / (0.0316 |\bm{g}_1|)$ — about 30× too large. Correction divides by $1-0.999^1 = 0.001$, giving $\hat{\bm{v}}_1 = \bm{g}_1^2$ — the correct scale.
Bias correction costs $2P + 2$ FLOPs (negligible) but prevents catastrophic early updates.
[3] Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.
The only difference from Adam is in the parameter update step. Instead of adding L2 regularization to the loss (which gets entangled with the adaptive learning rate), weight decay is applied directly:
$$\bm{\theta}_{t+1} = \bm{\theta}_t - \eta\left(\frac{\hat{\bm{m}}_t}{\sqrt{\hat{\bm{v}}_t} + \epsilon} + \lambda \bm{\theta}_t\right)$$where $\lambda$ is the weight decay coefficient. The key insight is that in Adam, L2 regularization and weight decay are not equivalent (unlike in SGD), because Adam's adaptive scaling distorts the regularization gradient. Decoupling them restores proper weight decay behavior.
| Metric | Value |
|---|---|
| Extra FLOPs vs. Adam | $+2P$ (scale $\theta$ by $\lambda$, add to update) |
| Extra memory vs. Adam | 0 |
| Extra hyperparameters | $\lambda$ (weight decay, typically $10^{-2}$) |
Loshchilov, I. & Hutter, F. (2019), "Decoupled Weight Decay Regularization", ICLR 2019
| Optimizer | Extra Memory | Per-Step FLOPs | Hyperparams | Key Property |
|---|---|---|---|---|
| SGD | 0 | $2P$ | $\eta$ | Simplest, lowest memory |
| SGD + Momentum | $4P$ bytes | $4P$ | $\eta, \mu$ | Accelerated convergence |
| Nesterov | $4P$ bytes | $5P$ | $\eta, \mu$ | Look-ahead correction |
| AdaGrad | $4P$ bytes | $7P$ | $\eta, \epsilon$ | Per-param LR; decays to 0 |
| RMSProp | $4P$ bytes | $8P$ | $\eta, \beta, \epsilon$ | Fixes AdaGrad decay |
| Adam | $\mathbf{8P}$ bytes | $14P$ | $\eta, \beta_1, \beta_2, \epsilon$ | Combined momentum + RMSProp |
| AdamW | $8P$ bytes | $16P$ | $\eta, \beta_1, \beta_2, \epsilon, \lambda$ | Proper weight decay for Adam |
| Optimizer | State Memory | Per-Step Optimizer FLOPs | % of Backprop FLOPs |
|---|---|---|---|
| SGD | 0 MB | 51.2M | 0.67% |
| Momentum | 97.7 MB | 102.4M | 1.35% |
| Adam | 195.3 MB | 358.4M | 4.72% |
ResNet-50 backward pass: ~7.6G FLOPs. The optimizer update adds at most 4.7% extra FLOPs — the optimizer is never the computational bottleneck. However, the extra memory (195 MB for Adam) can be a constraint for large models on limited GPU memory.
[3] Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.
[4] Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.
[5] Sutskever, I. et al. (2013). On the importance of initialization and momentum in deep learning. ICML 2013.
[6] Duchi, J. et al. (2011). Adaptive Subgradient Methods. JMLR, 12, 2121–2159.
[7] Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.
[1] Goodfellow et al. (2016). Deep Learning, Ch. 8: Optimization for Training Deep Models.
[8] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
Regularization prevents overfitting by constraining the model's capacity. From a computational efficiency perspective, the key question is: what is the per-step cost of each regularization technique, and how does it affect the total number of training steps needed?
Gradient contribution: $\pd{L_{\text{reg}}}{w_i} = \pd{L_{\text{data}}}{w_i} + \lambda \cdot \text{sign}(w_i)$
| Metric | Value |
|---|---|
| Extra forward FLOPs | $P$ (absolute values) + $P$ (sum) = $2P$ |
| Extra gradient FLOPs | $P$ (sign) + $P$ (scale by $\lambda$) + $P$ (add to gradient) = $3P$ |
| Extra memory | 0 |
Key property: L1 promotes sparsity — it drives many weights exactly to zero. Sparse weights can theoretically be exploited for computational savings at inference (via pruning), though in practice this requires hardware/software support for sparse operations.
Gradient contribution: $\pd{L_{\text{reg}}}{w_i} = \pd{L_{\text{data}}}{w_i} + \lambda w_i$
| Metric | Value |
|---|---|
| Extra gradient FLOPs | $P$ (scale $w_i$ by $\lambda$) + $P$ (add to gradient) = $2P$ |
| Extra memory | 0 (uses existing parameter values) |
In SGD, L2 regularization is equivalent to weight decay: $w_i \leftarrow w_i(1 - \eta\lambda) - \eta \pd{L}{w_i}$. The factor $(1-\eta\lambda)$ simply shrinks each weight by a constant fraction per step. This costs $P$ extra multiplications — negligible.
During training, for each activation $a_j$ in a layer with dropout rate $p$ (probability of dropping):
$$\text{mask}_j \sim \text{Bernoulli}(1-p)$$ $$\tilde{a}_j = \frac{a_j \cdot \text{mask}_j}{1-p}$$During inference: no dropout applied, no scaling needed (because of the $1/(1-p)$ factor during training — this is "inverted dropout").
For a layer with $n$ activations, batch size $B$:
| Operation | FLOPs (Training) | FLOPs (Inference) |
|---|---|---|
| Generate random mask | $Bn$ random number generations | 0 |
| Threshold to binary | $Bn$ comparisons | 0 |
| Multiply by mask | $Bn$ multiplications | 0 |
| Scale by $1/(1-p)$ | $Bn$ multiplications | 0 |
| Total | $\approx 4Bn$ FLOPs + RNG cost | 0 (free at inference) |
Extra memory (training): $Bn$ values for the binary mask — needed for the backward pass to apply the same mask to gradients.
Backward pass: $\pd{L}{\bm{a}} = \pd{L}{\tilde{\bm{a}}} \odot \frac{\text{mask}}{1-p}$ — same element-wise operation, $2Bn$ FLOPs.
Dropout adds $\approx 4Bn$ FLOPs per layer during training. For a hidden layer of size 4096 with $B=32$: $4 \times 32 \times 4096 = 524{,}288$ FLOPs. The matrix multiplication for that same layer (assuming 4096→4096): $2 \times 32 \times 4096 \times 4096 = 1.07$G FLOPs. Dropout overhead: 0.05% — completely negligible.
The random number generation is typically the bottleneck (cuRAND on GPU), not the arithmetic. But even this is small relative to matmul cost.
At inference, dropout costs exactly zero — this makes dropout "free" at deployment time.
[9] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 1929–1958.
Data augmentation (flipping, rotation, cropping, color jitter, mixup, cutout, etc.) is applied to input data on the CPU during data loading, typically in a separate process/thread that runs in parallel with GPU computation.
| Metric | Value |
|---|---|
| Extra GPU FLOPs | 0 (CPU preprocessing) |
| Extra GPU memory | 0 |
| Training throughput impact | Typically zero if data pipeline is not bottlenecked |
| Extra training data | Effectively infinite (stochastic transforms) |
Data augmentation is unique among regularization techniques: it costs virtually nothing computationally but provides significant regularization benefit. This makes it one of the most FLOP-efficient ways to improve generalization.
[9] Srivastava et al. (2014). Dropout. JMLR.
[1] Goodfellow et al. (2016). Deep Learning, Ch. 7: Regularization for Deep Learning.
[10] DeVries, T. & Taylor, G. W. (2017). Improved Regularization of Convolutional Neural Networks with Cutout. arXiv:1708.04552.
[11] Zhang, H. et al. (2018). mixup: Beyond Empirical Risk Minimization. ICLR 2018.
Normalization layers standardize intermediate activations, which stabilizes training, enables larger learning rates, and reduces the number of iterations needed to converge. They add both parameters and per-step FLOPs, but the reduction in training iterations typically more than compensates. Understanding their exact computational cost is critical for efficiency analysis.
For activations $\bm{x} \in \R^{B \times d}$ (or $\R^{B \times C \times H \times W}$ for CNNs, where statistics are computed per-channel across $B$, $H$, $W$):
Step 1 — Batch mean:
$$\mu_j = \frac{1}{N}\sum_{i=1}^{N} x_{ij}$$where $N = B$ for FC layers, $N = B \times H \times W$ for conv layers.
Step 2 — Batch variance:
$$\sigma_j^2 = \frac{1}{N}\sum_{i=1}^{N}(x_{ij} - \mu_j)^2$$Step 3 — Normalize:
$$\hat{x}_{ij} = \frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}$$Step 4 — Scale and shift (learnable):
$$y_{ij} = \gamma_j \hat{x}_{ij} + \beta_j$$Let $d$ = number of features (or channels for CNNs), $N$ = number of elements per feature contributing to the statistic ($B$ for FC, $BHW$ for conv).
| Step | Operation | FLOPs |
|---|---|---|
| 1. Mean | Sum $N$ values, divide | $Nd + d \approx Nd$ |
| 2. Variance | $(x-\mu)^2$: $Nd$ subs + $Nd$ squarings; sum; divide | $3Nd$ |
| 3. Normalize | $(x-\mu)$: already computed; $\sqrt{\sigma^2+\epsilon}$: $d$ sqrts; divide: $Nd$ | $Nd + d$ |
| 4. Scale+shift | $\gamma \hat{x} + \beta$: $Nd$ mults + $Nd$ adds | $2Nd$ |
| Total forward | $\approx 7Nd + 2d$ | |
| Running stats update | EMA for $\mu$, $\sigma^2$: $2d$ mults + $2d$ adds | $4d$ |
For a CNN conv layer: $N = B \times H \times W$, $d = C$ (channels).
Learnable parameters: $\gamma, \beta \in \R^C$ → $2C$ parameters
Running statistics (inference only): running mean + running variance → $2C$ values (not learnable, not parameters)
Memory for backprop: Must cache $\hat{\bm{x}}$ (normalized values), $\mu$, $\sigma^2$ → $Nd + 2d$ elements.
Given upstream gradient $\pd{L}{y_{ij}}$, we need $\pd{L}{x_{ij}}$, $\pd{L}{\gamma_j}$, and $\pd{L}{\beta_j}$.
(a) Gradient w.r.t. $\gamma$:
$$\pd{L}{\gamma_j} = \sum_{i=1}^{N} \pd{L}{y_{ij}} \cdot \hat{x}_{ij}$$FLOPs: $Nd$ multiplications + $Nd$ additions = $2Nd$
(b) Gradient w.r.t. $\beta$:
$$\pd{L}{\beta_j} = \sum_{i=1}^{N} \pd{L}{y_{ij}}$$FLOPs: $Nd$ additions
(c) Gradient w.r.t. input $x$ (the complex part):
$$\pd{L}{\hat{x}_{ij}} = \pd{L}{y_{ij}} \cdot \gamma_j$$ $$\pd{L}{\sigma_j^2} = \sum_{i=1}^{N}\pd{L}{\hat{x}_{ij}} \cdot (x_{ij} - \mu_j) \cdot \left(-\frac{1}{2}\right)(\sigma_j^2+\epsilon)^{-3/2}$$ $$\pd{L}{\mu_j} = \sum_{i=1}^{N}\pd{L}{\hat{x}_{ij}} \cdot \frac{-1}{\sqrt{\sigma_j^2+\epsilon}} + \pd{L}{\sigma_j^2}\cdot\frac{-2}{N}\sum_{i=1}^{N}(x_{ij}-\mu_j)$$ $$\pd{L}{x_{ij}} = \pd{L}{\hat{x}_{ij}}\cdot\frac{1}{\sqrt{\sigma_j^2+\epsilon}} + \pd{L}{\sigma_j^2}\cdot\frac{2(x_{ij}-\mu_j)}{N} + \pd{L}{\mu_j}\cdot\frac{1}{N}$$Total backward FLOPs: approximately $\mathbf{10Nd + \text{lower order terms}}$
Conv layer output: $B=32$, $C=256$, $H=W=56$.
$N = 32 \times 56 \times 56 = 100{,}352$. $d = 256$.
Forward: $7 \times 100{,}352 \times 256 \approx 180$M FLOPs
Backward: $\approx 10 \times 100{,}352 \times 256 \approx 257$M FLOPs
Compared to the conv layer itself: If the preceding 3×3 conv (128→256) has ~14.8G FLOPs, then BatchNorm forward adds 180M / 14.8G ≈ 1.2% overhead. BatchNorm is computationally cheap relative to convolution.
Parameters: $2 \times 256 = 512$ — negligible vs. the conv layer's ~295K params.
| Training | Inference | |
|---|---|---|
| Uses | Batch statistics ($\mu_B$, $\sigma_B^2$) | Running statistics ($\mu_{\text{run}}$, $\sigma_{\text{run}}^2$) |
| Compute per element | ~7 FLOPs (forward) | ~4 FLOPs (normalize + scale/shift) |
| Batch dependence | Yes — stats depend on entire batch | No — uses fixed running stats |
| Can fuse with conv | No (stats depend on data) | Yes — BN can be folded into preceding conv weights! |
BN Fusion at inference: Since inference uses fixed $\mu$, $\sigma$, $\gamma$, $\beta$, the entire BN operation is a linear transform: $y = \frac{\gamma}{\sigma}x + (\beta - \frac{\gamma\mu}{\sigma})$. This can be folded into the preceding conv/FC layer's weights and biases, making BN zero-cost at inference.
[12] Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015.
Instead of normalizing across the batch (per feature), LayerNorm normalizes across features (per sample):
$$\mu_i = \frac{1}{d}\sum_{j=1}^{d} x_{ij}, \qquad \sigma_i^2 = \frac{1}{d}\sum_{j=1}^{d}(x_{ij} - \mu_i)^2$$ $$y_{ij} = \gamma_j \cdot \frac{x_{ij} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}} + \beta_j$$| Metric | Value |
|---|---|
| Normalization axis | Features (per sample) |
| Parameters | $2d$ ($\gamma, \beta$) |
| Forward FLOPs | $\approx 7Bd$ (same order as BatchNorm) |
| Batch dependence | None — each sample normalized independently |
| Can fuse at inference | No (statistics are data-dependent per sample) |
LayerNorm is the standard normalization for Transformers and RNNs, where batch statistics are unreliable (variable sequence lengths, small effective batches). Its per-element computational cost is identical to BatchNorm, but it cannot be fused at inference — it always costs ~4 FLOPs per element.
Divides $C$ channels into $G$ groups and normalizes within each group independently. For each group $g$ containing channels $c_{g,1}, \ldots, c_{g,C/G}$:
Statistics are computed over the $C/G$ channels and all spatial positions $H \times W$ within each sample. Each statistic is computed over $(C/G) \times H \times W$ elements.
| $G$ | Method | Normalize Over |
|---|---|---|
| $G = 1$ | Layer Normalization | All $C \times H \times W$ elements |
| $1 < G < C$ | Group Normalization | $(C/G) \times H \times W$ elements |
| $G = C$ | Instance Normalization | $H \times W$ elements per channel |
Parameters: $2C$ (same as BatchNorm). FLOPs: Same order as BatchNorm. Batch dependence: None.
Wu, Y. & He, K. (2018), "Group Normalization", ECCV 2018
| Method | Normalize Over | Params | FLOPs | Batch Dep. | Fuse@Inf | Best For |
|---|---|---|---|---|---|---|
| BatchNorm | $B \times (H \times W)$ per channel | $2C$ | $\sim 7NdC$ | Yes | Yes | CNNs (large batch) |
| LayerNorm | $d$ features per sample | $2d$ | $\sim 7Bd$ | No | No | Transformers, RNNs |
| GroupNorm | $(C/G) \times H \times W$ per sample | $2C$ | $\sim 7BCHW$ | No | No | CNNs (small batch) |
| InstanceNorm | $H \times W$ per channel per sample | $2C$ | $\sim 7BCHW$ | No | No | Style transfer |
[12] Ioffe, S. & Szegedy, C. (2015). Batch Normalization. ICML 2015.
[13] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450.
[14] Wu, Y. & He, K. (2018). Group Normalization. ECCV 2018.
[15] Santurkar, S. et al. (2018). How Does Batch Normalization Help Optimization? NeurIPS 2018.
The learning rate schedule determines how the learning rate $\eta$ changes over the course of training. While the per-step cost of computing a new learning rate is negligible ($O(1)$ FLOPs), the choice of schedule profoundly affects convergence speed and therefore total training FLOPs. A good schedule can halve the number of training epochs needed.
where $\gamma \in (0,1)$ is the decay factor (typically 0.1) and $S$ is the step interval (in epochs). Commonly: reduce LR by 10× every 30 epochs.
Computational cost: 1 floor division + 1 power + 1 multiplication = $O(1)$ per step.
where $T$ is the total number of training steps.
Computational cost: 1 cosine + 3 multiplications + 2 additions = $O(1)$.
Cosine annealing smoothly decreases the learning rate from $\eta_{\max}$ to $\eta_{\min}$, spending more time at moderate learning rates than step decay. Empirically, it often gives slightly better final accuracy.
Loshchilov, I. & Hutter, F. (2017), "SGDR: Stochastic Gradient Descent with Warm Restarts", ICLR 2017
After warmup: transition to the main schedule (cosine, step decay, etc.).
Warmup prevents large, destabilizing updates in the early stages of training when: (a) the network is far from any reasonable solution, (b) batch statistics (for BatchNorm) are unreliable, or (c) the Adam second moment estimates are unreliable (bias correction partially addresses this, but warmup helps further).
Goyal, P. et al. (2017), "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour"
Warmup adds a small number of low-learning-rate steps at the start (typically 5–10 epochs out of 90+). These steps produce small weight updates and may seem "wasted." However, without warmup, large initial LR can cause divergence, requiring the entire training run to be restarted. Warmup is therefore a form of insurance: $\approx$5% extra steps to avoid $\gg$100% wasted steps from divergence.
The learning rate follows a cycle:
Phase 1 (warmup, ~30% of training): LR increases linearly from $\eta_{\text{low}}$ to $\eta_{\text{max}}$
Phase 2 (annealing, ~70% of training): LR decreases via cosine from $\eta_{\text{max}}$ to $\eta_{\text{low}}/10$
The key finding: using a much larger $\eta_{\text{max}}$ than typical (found via LR range test) enables "super-convergence" — training to the same accuracy in 5–10× fewer steps than standard schedules.
Computational implication: If one-cycle reaches target accuracy in 30 epochs instead of 90, the total training FLOPs are reduced by 3×. The per-step cost is identical; only the number of steps differs.
Smith, L. N. & Topin, N. (2019), "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates"
[16] Loshchilov, I. & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017.
[17] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.
[18] Smith, L. N. & Topin, N. (2019). Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. AISTATS 2019.
| Component | Extra FLOPs per Step | Extra Memory (bytes) | % of Training Cost |
|---|---|---|---|
| Forward pass | $F$ | Activations: $O(BL\bar{d})$ | ~33% |
| Backward pass | $\approx 2F$ | Gradients: $4P$ | ~66% |
| Loss computation | $O(BC)$ | $O(BC)$ | <0.01% |
| Optimizer (SGD) | $2P$ | 0 | <1% |
| Optimizer (Adam) | $14P$ | $8P$ | ~2–5% |
| L2 regularization | $2P$ | 0 | <0.5% |
| Dropout (per layer) | $\approx 4Bn$ | $4Bn$ (mask) | <0.1% |
| BatchNorm (per layer, fwd) | $\approx 7Nd$ | $4(Nd + 2d)$ | ~1–2% |
| BatchNorm (per layer, bwd) | $\approx 10Nd$ | — | ~2–3% |
| LR schedule | $O(1)$ | 0 | $\approx 0$ |
The forward and backward passes through the network layers (matrix multiplications and convolutions) account for ~95–98% of total training FLOPs. All other components — optimizer updates, normalization, regularization, loss computation — are collectively less than 5%. Therefore:
(1) Reducing model FLOPs (efficient architectures, pruning, quantization) is the primary lever for training efficiency.
(2) Reducing the number of training steps (better optimizers, LR schedules, initialization) is the secondary lever — it multiplies the total by a smaller constant.
(3) Optimizer memory (not FLOPs) is often the binding constraint — Adam's $8P$ extra bytes can be the difference between fitting a model on a GPU or not.
where $M_{\text{opt}} = 0$ (SGD), $1$ (Momentum), $2$ (Adam), and $\text{act}_l(B)$ is the activation tensor size at layer $l$ for batch size $B$.
For ResNet-50 (25.6M params), Adam, B=32, 224×224 input:
Params: 97.7 MB | Grads: 97.7 MB | Adam state: 195.3 MB | Activations: ~2.5 GB | Total: ~2.9 GB
Activations dominate for this configuration — a common pattern in modern CNNs.
[1] Goodfellow et al. (2016). Deep Learning, Ch. 6.2.2 (Cost Functions), Ch. 8 (Optimization).
[2] Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.
[3] Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.
[4] Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019.
[5] Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML 2013.
[6] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR, 12, 2121–2159.
[7] Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.
[8] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Comp. Math. and Math. Physics, 4(5), 1–17.
[9] Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 1929–1958.
[10] DeVries, T. & Taylor, G. W. (2017). Improved Regularization with Cutout. arXiv:1708.04552.
[11] Zhang, H. et al. (2018). mixup: Beyond Empirical Risk Minimization. ICLR 2018.
[12] Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training. ICML 2015.
[13] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450.
[14] Wu, Y. & He, K. (2018). Group Normalization. ECCV 2018.
[15] Santurkar, S. et al. (2018). How Does Batch Normalization Help Optimization? NeurIPS 2018.
[16] Loshchilov, I. & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017.
[17] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.
[18] Smith, L. N. & Topin, N. (2019). Super-Convergence. AISTATS 2019.
[19] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.