Part V of VI

Computational Efficiency Analysis

FLOP counting methodology, memory budgets, hardware-aware computation, the roofline model, inference optimization, mixed precision, gradient checkpointing, and scaling laws — the complete toolkit for analyzing and optimizing neural network efficiency

Contents

§17 FLOP Counting Methodology

17.1 FLOPs vs. MACs vs. FLOP/s — Disambiguation

Key Definitions

FLOP (Floating-Point Operation): A single arithmetic operation on floating-point numbers — one addition, one multiplication, one comparison, etc.

MAC (Multiply-Accumulate): One multiplication followed by one addition: $a \leftarrow a + b \times c$. This counts as 2 FLOPs.

FLOPs (plural): The total count of floating-point operations in a computation. (Confusingly, some papers use "FLOPs" to mean MACs.)

FLOP/s (FLOPs per second): A measure of hardware throughput, not a count. Measured in GFLOP/s (giga) or TFLOP/s (tera).

Critical Disambiguation

There is widespread inconsistency in the literature. When a paper reports "X GFLOPs" for a model, it may mean either $X$ billion FLOPs or $X$ billion MACs (= $2X$ billion FLOPs). Always check the definition used. In this document, we consistently use FLOPs = total floating-point operations, where one MAC = 2 FLOPs. This is the convention used by Sze et al. (2017) and most hardware specifications.

For matrix multiplication of $(m \times n) \times (n \times p)$:

Our convention: $2mnp$ FLOPs (each of the $mp$ outputs requires $n$ MACs = $2n$ FLOPs)

Some papers: $mnp$ "multiply-adds" or "MACs"

17.2 Counting FLOPs for Complete Networks

Master FLOP Formulas (Forward Pass, Per Sample)
Layer Type FLOPs Dominant Term
Fully Connected ($n_{\text{in}} \to n_{\text{out}}$) $2 n_{\text{in}} n_{\text{out}}$ MatMul
Conv2D ($C_i \to C_o$, $k \times k$) $2 C_o H_o W_o C_i k^2$ MatMul (via im2col)
Depthwise Conv ($C$, $k \times k$) $2 C H_o W_o k^2$ Per-channel conv
1×1 Conv ($C_i \to C_o$) $2 C_o H W C_i$ MatMul
DW Separable ($C_i \to C_o$, $k \times k$) $2 H_o W_o C_i(k^2 + C_o)$ Pointwise dominates
BatchNorm ($C$ channels, spatial $H \times W$) $\approx 7 C H W$ Element-wise
ReLU ($n$ elements) $n$ Comparison
Softmax ($C$ classes) $\approx 5C$ Exp + division
Residual Addition ($n$ elements) $n$ Element-wise add
Max Pool ($k \times k$) $(k^2-1) C H_o W_o$ Comparisons
Global Average Pool ($C$ channels) $CHW$ Additions
Where Do the FLOPs Go?

For typical CNNs, the distribution of FLOPs across operation types is approximately:

Operation Type % of Forward FLOPs Examples
Matrix Multiplication (Conv + FC) 90–99% Conv layers, FC layers
Normalization (BN/LN) 1–5% BatchNorm, LayerNorm
Activation Functions 0.1–1% ReLU, GELU
Pooling 0.1–0.5% MaxPool, GAP
Residual Connections 0.01–0.1% Skip addition
Softmax + Loss <0.01% Output layer

This extreme concentration means that optimizing matrix multiplication is virtually synonymous with optimizing neural network computation.

17.3 The 3× Rule: Forward, Backward, and Training

The Fundamental Training Cost Identity

For any neural network layer performing a linear operation $\bm{Y} = \bm{X}\bm{W}\T$:

Forward: 1 matmul → $2mn_{\text{in}}n_{\text{out}}$ FLOPs

Backward (weight gradient): $\pd{L}{\bm{W}} = \bm{\delta}\T\bm{X}$ → 1 matmul → same order FLOPs

Backward (input gradient): $\pd{L}{\bm{X}} = \bm{\delta}\bm{W}$ → 1 matmul → same order FLOPs

$$\boxed{\text{FLOPs}_{\text{backward}} \approx 2 \times \text{FLOPs}_{\text{forward}}}$$ $$\boxed{\text{FLOPs}_{\text{training step}} \approx 3 \times \text{FLOPs}_{\text{forward}}}$$

This holds for every linear layer (FC, Conv, Attention) and therefore for the entire network.

17.4 Total Training FLOPs Formula

Total Training Computation
$$\text{FLOPs}_{\text{total training}} = 3 \times F_{\text{forward}} \times N_{\text{samples}} \times E$$

where $F_{\text{forward}}$ = forward FLOPs per sample, $N_{\text{samples}}$ = dataset size, $E$ = number of epochs.

Equivalently, using total tokens/samples seen:

$$\text{FLOPs}_{\text{total}} = 3 \times F_{\text{forward}} \times D$$

where $D = N_{\text{samples}} \times E$ = total samples processed during training.

Worked Example — ResNet-50 on ImageNet

Forward FLOPs per image: 3.8G (≈ 1.9G MACs)

Dataset: 1.28M images

Epochs: 90 (standard)

$$\text{FLOPs}_{\text{total}} = 3 \times 3.8 \times 10^9 \times 1.28 \times 10^6 \times 90$$ $$= 3 \times 3.8 \times 1.28 \times 90 \times 10^{15} = 1.32 \times 10^{18} \text{ FLOPs}$$ $$\approx 1.32 \text{ EFLOPs (exaFLOPs)}$$

Training time estimate: On NVIDIA V100 (125 TFLOP/s FP16, ~50% utilization):

$$t = \frac{1.32 \times 10^{18}}{125 \times 10^{12} \times 0.5} \approx 21{,}100 \text{ seconds} \approx 5.9 \text{ hours}$$

(In practice, ~6–8 hours on 8× V100 with data parallelism, consistent with this estimate.)

References for §17

[1] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks. Proc. IEEE, 105(12), 2295–2329.

[2] Canziani, A., Paszke, A., & Culurciello, E. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.


§18 Memory Analysis

Memory is frequently the binding constraint in neural network training — not FLOPs. A model that fits on a GPU can be trained; one that doesn't fit cannot, regardless of how many FLOPs the GPU can execute. Understanding memory composition in detail is therefore critical for practical efficiency research.

18.1 The Four Components of Training Memory

Training Memory Decomposition
$$\text{Memory}_{\text{training}} = M_{\text{params}} + M_{\text{gradients}} + M_{\text{optimizer}} + M_{\text{activations}} + M_{\text{temp}}$$
Component Size (float32) Depends On Scales With
Parameters $\bm{\theta}$ $4P$ bytes Architecture only Model size
Gradients $\nabla\bm{\theta}$ $4P$ bytes Architecture only Model size
Optimizer state $4M_{\text{opt}}P$ bytes Architecture + optimizer choice Model size
Activations (cached for backprop) $4 \times \sum_l |A_l|$ bytes Architecture + batch size Model size × batch
Temporary buffers Variable Implementation Largest single operation

$M_{\text{opt}}$: 0 (SGD), 1 (Momentum), 2 (Adam)

Training Memory Breakdown — ResNet-50 (25.6M params, Adam, B=32) Parameters: 97.7 MB (3.3%) Gradients: 97.7 MB (3.3%) Adam State (m + v): 195.3 MB (6.7%) Cached Activations: ~2.5 GB (85.5%) Every intermediate tensor from every layer × batch size Temp buffers: ~50 MB (1.7%) Total: ~2.9 GB
Figure 18.1. Training memory breakdown for ResNet-50 with Adam optimizer and batch size 32. Cached activations dominate at 85.5% of total memory — this is the primary target for memory optimization techniques like gradient checkpointing and mixed precision.

18.2 Activation Memory — The Dominant Cost

Activations must be stored during the forward pass and kept in memory until the corresponding backward pass layer processes them. For a network with $L$ layers, the activation at layer $l$ is freed only after layer $l$'s backward pass completes. The peak activation memory is therefore approximately the sum of all layer activations.

Activation Memory by Layer Type
Layer What Must Be Cached Size (elements per sample)
Conv ($C_o$, $H_o \times W_o$) Input tensor (for weight grad) $C_i \times H_i \times W_i$
ReLU Binary mask (1 bit per element) $C \times H \times W / 8$ bytes (packed)
BatchNorm Normalized values, $\mu$, $\sigma^2$ $C \times H \times W + 2C$
FC ($n_{\text{out}}$) Input vector $n_{\text{in}}$
Dropout Binary mask $n / 8$ bytes (packed)
Max Pool Argmax indices $C \times H_o \times W_o$ (int)
Residual Add Both inputs (skip + main branch) $2 \times C \times H \times W$

For a deep network, these accumulate: ResNet-50 with B=32 stores ~640M float32 values ≈ 2.5 GB for activations alone.

Activation Memory Scaling
$$M_{\text{act}} \propto B \times \sum_{l=1}^{L} n_l$$

where $n_l$ is the activation tensor size at layer $l$. Key observations:

Scales linearly with batch size $B$: halving $B$ halves activation memory.

Scales linearly with depth $L$: for fixed-width networks, doubling depth doubles activation memory.

Scales quadratically with resolution: doubling input resolution (e.g., 224→448) quadruples $H \times W$ at every layer → 4× activation memory. This is why high-resolution training is so memory-hungry.

18.3 Gradient Checkpointing (Rematerialization)

Gradient checkpointing is the most important memory optimization technique for training. It trades computation for memory by discarding some activations during the forward pass and recomputing them during the backward pass.

Gradient Checkpointing — Theory

For a network with $L$ layers, divide it into $K$ segments of $L/K$ layers each. During the forward pass, store only the activations at segment boundaries (checkpoints). During the backward pass, when processing segment $k$, recompute that segment's forward pass from the nearest checkpoint.

Optimal checkpoint strategy ($\sqrt{L}$ checkpoints):

$$K = \sqrt{L}$$ $$\text{Memory}_{\text{act}} = O(\sqrt{L}) \text{ instead of } O(L)$$ $$\text{Extra FLOPs} = +33\% \text{ (one extra forward pass per segment, amortized)}$$

More precisely, the extra computation equals one additional forward pass through the entire network, meaning:

$$\text{FLOPs}_{\text{training}} = \underbrace{F}_{\text{fwd}} + \underbrace{F}_{\text{recompute}} + \underbrace{2F}_{\text{bwd}} = 4F \text{ instead of } 3F$$ $$\text{Overhead} = \frac{4F - 3F}{3F} = 33\%$$
Gradient Checkpointing — Memory Savings Standard (store all) Memory: O(L) = 8 blocks Checkpointed (√L segments) ckpt ckpt ckpt Memory: O(√L) = 3 stored + recompute zone Trade-off Summary Memory: O(L) → O(√L) reduction | Extra compute: +33% | For L=100: memory ÷10 for just +33% FLOPs
Figure 18.2. Gradient checkpointing. Standard training stores all $L$ layer activations. Checkpointed training stores only $\sqrt{L}$ checkpoint activations and recomputes the rest during the backward pass. For a 100-layer network, this reduces activation memory by 10× at a cost of 33% more FLOPs — an extremely favorable trade-off when memory is the bottleneck.
Gradient Checkpointing — ResNet-50 Numbers
Standard Checkpointed (4 segments)
Activation memory ~2.5 GB ~0.6 GB
Training FLOPs per step $3F$ (standard) $4F$ (+33%)
Max batch size (16 GB GPU) B ≈ 160 B ≈ 640
Effective throughput Baseline +300% batch size, -25% per-step speed

The larger batch size enabled by checkpointing can actually increase throughput despite the extra computation, because larger batches utilize GPU hardware more efficiently.

Reference

[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.

18.4 Mixed Precision Training

Mixed Precision Training

Store activations and compute forward/backward passes in FP16 (16-bit floating point, 2 bytes per value) while maintaining a FP32 master copy of weights for the optimizer update. This approach:

(1) Halves activation memory (2 bytes instead of 4 per element)

(2) Doubles arithmetic throughput on tensor cores (2× FP16 FLOP/s vs FP32)

(3) Halves memory bandwidth usage (move half as many bytes)

Mixed Precision — Memory Savings
Component FP32 Mixed Precision Savings
Parameters (master copy) $4P$ $4P$ (keep FP32) 0%
Parameters (FP16 copy for fwd/bwd) $2P$ (extra)
Gradients $4P$ $2P$ (FP16) 50%
Adam $m$ $4P$ $4P$ (FP32) 0%
Adam $v$ $4P$ $4P$ (FP32) 0%
Activations $4 \sum|A_l|$ $2 \sum|A_l|$ (FP16) 50%

The main saving is in activations (which dominate). For ResNet-50: activation memory drops from ~2.5 GB to ~1.25 GB.

Total parameter memory (Adam, mixed precision): $4P + 2P + 2P + 4P + 4P = 16P$ bytes — same as FP32 Adam (the FP16 weight copy offsets the gradient savings).

Loss Scaling — Essential for FP16 Stability

FP16 has a limited dynamic range: values below $\sim 6 \times 10^{-5}$ underflow to zero. Many gradient values in deep networks fall below this threshold. Loss scaling multiplies the loss by a large constant $S$ (e.g., $S = 1024$) before backpropagation, which scales all gradients by $S$, keeping them within FP16 range. Before the optimizer update, gradients are divided by $S$ and converted to FP32.

Dynamic loss scaling: starts with a large $S$ and halves it whenever a NaN/Inf is detected. Cost: $O(P)$ per step for the check — negligible.

Micikevicius, P. et al. (2018), "Mixed Precision Training", ICLR 2018

18.5 Gradient Accumulation

Gradient Accumulation

Run $K$ forward-backward passes with micro-batch size $B_{\text{micro}}$, accumulating gradients, then perform one optimizer update. The effective batch size is:

$$B_{\text{effective}} = K \times B_{\text{micro}}$$

Memory: Only $B_{\text{micro}}$ samples in memory at once → activation memory scales with $B_{\text{micro}}$, not $B_{\text{effective}}$.

FLOPs: Identical to training with $B_{\text{effective}}$ directly (same number of forward/backward operations).

Speed: Slightly slower than true large-batch training due to sequential processing and reduced GPU parallelism, but enables arbitrary effective batch sizes on limited memory.

Gradient Accumulation — When 32 GB Isn't Enough

Suppose you need effective batch size 256 for convergence, but your GPU only fits batch size 16:

$K = 256 / 16 = 16$ accumulation steps per optimizer update.

Memory used: Same as B=16 (not B=256).

Training FLOPs: Same as B=256 (same total samples processed).

Wall-clock time: ~16× slower than 16 GPUs doing B=16 each in parallel (data parallelism), but free in terms of hardware cost.

References for §18

[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.

[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.

[1] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks.


§19 Hardware-Aware Computation

FLOPs alone do not determine execution speed. Two operations with the same FLOP count can differ by 100× in actual runtime depending on how well they map to hardware. This section bridges the gap between theoretical FLOPs and real-world performance.

19.1 The Roofline Model

Arithmetic Intensity

The arithmetic intensity (AI) of an operation is the ratio of computation to data movement:

$$\text{AI} = \frac{\text{FLOPs}}{\text{Bytes accessed from/to memory}}$$

Units: FLOPs/byte. This is the fundamental quantity determining whether an operation is compute-bound or memory-bound.

The Roofline Model

Achievable performance is bounded by:

$$\text{Performance (FLOP/s)} \leq \min\left(\text{Peak compute (FLOP/s)},\;\; \text{AI} \times \text{Memory bandwidth (bytes/s)}\right)$$

The "roofline" is the plot of achievable FLOP/s vs. arithmetic intensity. It rises linearly with AI (memory-bound regime) until it hits the peak compute (compute-bound regime). The crossover point is:

$$\text{AI}_{\text{ridge}} = \frac{\text{Peak FLOP/s}}{\text{Memory bandwidth}}$$

Operations with $\text{AI} < \text{AI}_{\text{ridge}}$ are memory-bound; operations with $\text{AI} > \text{AI}_{\text{ridge}}$ are compute-bound.

The Roofline Model (Log-Log Scale) Arithmetic Intensity (FLOPs / Byte) → Achievable FLOP/s → 1 10 100 1000 Ridge Point (AI_ridge) Memory-Bound (bandwidth-limited) Compute-Bound (peak FLOP/s-limited) ReLU BN Bias add Large MatMul Conv3×3 Small Conv
Figure 19.1. The Roofline Model. Matrix multiplications and large convolutions have high arithmetic intensity and are compute-bound — they fully utilize the GPU's peak FLOP/s. Element-wise operations (ReLU, BatchNorm, bias addition) have low arithmetic intensity and are memory-bound — they are limited by how fast data can be moved, not how fast it can be processed.

19.2 Compute-Bound vs. Memory-Bound Operations

Arithmetic Intensity of Key Operations
Operation FLOPs Bytes Moved AI (FLOP/byte) Bound
MatMul $(M \times K) \times (K \times N)$, large $2MKN$ $4(MK + KN + MN)$ $\approx \frac{M}{2}$ (for $M=K=N$) Compute
Conv 3×3 (256→256, 56×56) $\sim$1.5G $\sim$15M bytes ~100 Compute
ReLU (1M elements) 1M $4 \times 2 \times 1$M = 8M bytes 0.125 Memory
BatchNorm (1M elements) $\sim$7M $\sim$20M bytes 0.35 Memory
Bias addition (1M elements) 1M 8M bytes 0.125 Memory
Softmax (1000 classes) ~5K ~12K bytes 0.42 Memory

For an NVIDIA A100 GPU: $\text{AI}_{\text{ridge}} = \frac{312 \text{ TFLOP/s}}{2039 \text{ GB/s}} \approx 153$ FLOP/byte. Only large matrix multiplications and convolutions exceed this threshold.

Why FLOPs ≠ Speed

Consider two hypothetical operations, each with 1G FLOPs:

Operation A (one large matmul): AI ≈ 200 FLOP/byte → compute-bound at 312 TFLOP/s → time = 1G / 312T ≈ 3.2 μs

Operation B (1000 element-wise ops on 1M elements each): AI ≈ 0.25 FLOP/byte → memory-bound at 2039 GB/s → time = 4G bytes / 2039 GB/s ≈ 2 ms

Same FLOPs, but operation B is 625× slower because it's bandwidth-limited. This is why operator fusion (combining multiple element-wise ops into one kernel to avoid round-trips to memory) is critical for inference optimization.

19.3 GPU Hardware and Tensor Cores

GPU Architecture — Key Numbers
GPU FP32 TFLOP/s FP16 TFLOP/s INT8 TOPS HBM (GB) BW (GB/s)
V100 (2017) 15.7 125 32 900
A100 (2020) 19.5 312 624 40/80 2,039
H100 (2022) 51 990 1,979 80 3,350

Tensor cores are specialized hardware units that compute small matrix multiplications (e.g., 4×4×4) in a single cycle. They provide the FP16/BF16 throughput numbers above. Key: tensor cores only accelerate matrix multiplication, not element-wise operations — this is another reason matmul dominates and element-wise ops are the bottleneck.

Utilization: Real workloads typically achieve 30–60% of peak TFLOP/s due to memory transfers, kernel launch overhead, synchronization, and imperfect tiling. Well-optimized large matmuls can reach 70–85%.

19.4 Data Types and Throughput

Numerical Formats in Deep Learning
Format Bits Bytes Range Precision Use Relative Throughput
FP32 32 4 $\pm 3.4 \times 10^{38}$ ~7 decimal digits Training (master weights)
FP16 16 2 $\pm 6.5 \times 10^4$ ~3.3 digits Training (mixed precision) 2× (V100), 16× (A100 TC)
BF16 16 2 $\pm 3.4 \times 10^{38}$ ~2.4 digits Training (robust to overflow) Same as FP16
INT8 8 1 $-128$ to $127$ Integer only Inference quantization 4× (A100)
INT4 4 0.5 $-8$ to $7$ Very low Weight quantization 8× (theoretical)

Key insight: Moving from FP32 to FP16/BF16 gives: (a) 2× memory savings for activations, (b) 2–16× compute throughput improvement, (c) 2× memory bandwidth improvement. This is why mixed precision is nearly universal for training.

BF16 vs. FP16: BF16 has the same exponent range as FP32 (no overflow risk) but lower precision. FP16 has higher precision but requires loss scaling to handle small gradients. BF16 is preferred on newer hardware (A100+) because it eliminates the need for loss scaling.

19.5 Batch Size and Hardware Utilization

Batch Size → Utilization Relationship

Matrix multiplication of $(\text{B} \times n_{\text{in}}) \times (n_{\text{in}} \times n_{\text{out}})$ has arithmetic intensity:

$$\text{AI} = \frac{2B \cdot n_{\text{in}} \cdot n_{\text{out}}}{4(B \cdot n_{\text{in}} + n_{\text{in}} \cdot n_{\text{out}} + B \cdot n_{\text{out}})}$$

For small $B$: $\text{AI} \approx \frac{B}{2}$ — directly proportional to batch size. Doubling $B$ doubles AI, moving the operation further into the compute-bound regime.

For large $B$ (where $B \gg n$): $\text{AI} \approx \frac{n}{2}$ — saturates at the layer width.

Practical Batch Size Guidelines

B too small (e.g., 1–4): Matrix multiplications become memory-bound. GPU utilization <10%. Most time spent moving data, not computing.

B moderate (e.g., 32–256): Good compute-bound regime for typical layer sizes. GPU utilization 40–70%.

B large (e.g., 1024+): Near-peak utilization, but: (a) may require more epochs to converge (generalization gap), (b) requires linear scaling of learning rate (Goyal et al., 2017), (c) may exceed GPU memory.

The optimal batch size balances GPU utilization against convergence behavior. This is fundamentally a hardware-algorithm co-design problem.

References for §19

[5] Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, 52(4), 65–76.

[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.

[6] Jia, Z. et al. (2019). Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826.

[7] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD. arXiv:1706.02677.


§20 Inference Optimization

Inference has fundamentally different constraints from training: no backpropagation (so no activation storage), typically smaller batch sizes (often B=1 for real-time applications), and a premium on latency rather than throughput. These constraints call for different optimization strategies.

20.1 Operator Fusion

Operator Fusion

Combining multiple sequential operations into a single GPU kernel, eliminating intermediate memory reads/writes. Common fusions:

Fused Operation Components Memory Savings
Conv-BN-ReLU Convolution + BatchNorm + ReLU Eliminates 2 intermediate tensors
Linear-GELU FC layer + GELU activation Eliminates 1 intermediate tensor
Residual-BN-ReLU-Add BN + ReLU + skip addition Eliminates 2 intermediate tensors

FLOP count: Unchanged — fusion does not reduce arithmetic, only memory traffic.

Speedup: 1.2–2× for memory-bound operation chains (which includes most non-matmul operations).

At inference, BatchNorm can be fully folded into the preceding conv layer (as shown in Part IV §15.1), making BN completely free.

20.2 Quantization

Post-Training Quantization (PTQ)

Convert trained FP32 weights and activations to lower precision (INT8, INT4) for inference:

$$q = \text{round}\!\left(\frac{x - z}{s}\right), \qquad x \approx s \cdot q + z$$

where $s$ is the scale factor and $z$ is the zero-point.

Quantization — Efficiency Impact
Metric FP32 FP16 INT8 INT4
Bytes per value 4 2 1 0.5
Model size (relative) 0.5× 0.25× 0.125×
Memory bandwidth (relative)
Compute throughput (vs FP32) 2–16× 4–32× 8–64× (emerging)
Typical accuracy loss <0.1% 0.1–0.5% 0.5–3%

INT8 quantization is the current sweet spot: 4× model compression, 4× bandwidth improvement, and <0.5% accuracy loss for most models. Frameworks like TensorRT, ONNX Runtime, and PyTorch all support INT8 inference natively.

20.3 Pruning

Weight Pruning

Unstructured pruning: Set individual weights to zero. Sparsity of 80–95% is typical.

$$\text{FLOPs}_{\text{theoretical}} = (1 - s) \times \text{FLOPs}_{\text{dense}}$$

where $s$ is sparsity fraction. But: unstructured sparsity is hard to accelerate on GPUs (irregular memory access patterns). Real speedup is often only 1.5–2× even at 90% sparsity, versus the theoretical 10×.

Structured pruning: Remove entire channels, filters, or layers. This produces dense, regular tensors that map directly to existing hardware.

$$\text{FLOPs}_{\text{structured}} = \text{FLOPs of smaller dense model}$$

No special hardware support needed — equivalent to training a smaller model. Speedup is proportional to FLOP reduction.

20.4 Knowledge Distillation

Knowledge Distillation

Train a small "student" model to mimic the outputs (soft labels) of a large "teacher" model:

$$L_{\text{distill}} = \alpha \cdot L_{\text{CE}}(\hat{y}_{\text{student}}, y_{\text{true}}) + (1-\alpha) \cdot T^2 \cdot D_{\text{KL}}\!\left(\text{softmax}\!\left(\frac{z_{\text{student}}}{T}\right) \;\Big\|\; \text{softmax}\!\left(\frac{z_{\text{teacher}}}{T}\right)\right)$$

where $T$ is the temperature (typically 3–20) and $\alpha$ balances hard and soft labels.

Metric Value
Training cost Cost of running teacher (once) + training student (standard)
Inference cost Only the student model — teacher is discarded
Typical results Student achieves 95–99% of teacher accuracy at 3–10× fewer FLOPs

Hinton, G. et al. (2015), "Distilling the Knowledge in a Neural Network"

References for §20

[8] Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018.

[9] Han, S., Mao, H., & Dally, W. J. (2016). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR 2016.

[10] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Workshop 2015.


§21 Scaling Laws

21.1 How FLOPs Scale with Model Dimensions

FLOP Scaling for MLPs and CNNs
Scaling Dimension Parameter Scaling FLOP Scaling (per sample) FLOP Scaling (training)
Width × $\alpha$ (all layers) $\propto \alpha^2$ $\propto \alpha^2$ $\propto \alpha^2$
Depth × $\alpha$ (add layers) $\propto \alpha$ $\propto \alpha$ $\propto \alpha$
Resolution × $\alpha$ (input size) unchanged (CNN) $\propto \alpha^2$ (spatial dimensions) $\propto \alpha^2$
Batch size × $\alpha$ unchanged unchanged (per sample) unchanged (same total samples)
Epochs × $\alpha$ unchanged unchanged $\propto \alpha$

Width is the most expensive dimension to scale — doubling all layer widths quadruples both parameters and FLOPs. Depth scaling is linear in both. Resolution scaling is quadratic in FLOPs but doesn't add parameters (for CNNs with weight sharing).

This is precisely why EfficientNet's compound scaling ($d = \alpha^\phi$, $w = \beta^\phi$, $r = \gamma^\phi$ with $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$) is designed to balance these different scaling rates.

21.2 Neural Scaling Laws

Kaplan et al. (2020) — Scaling Laws for Neural Language Models

For Transformer language models, test loss follows power laws in three variables:

$$L(N) \propto N^{-\alpha_N}, \qquad L(D) \propto D^{-\alpha_D}, \qquad L(C) \propto C^{-\alpha_C}$$

where $N$ = model parameters, $D$ = dataset size (tokens), $C$ = training compute (FLOPs).

Empirically: $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, $\alpha_C \approx 0.050$.

Key findings:

(1) Performance improves predictably as a power law in compute — no surprises, no plateaus (within the studied range).

(2) Given a fixed compute budget $C$, the optimal allocation is to use a large model trained on a moderate amount of data, rather than a small model trained on a lot of data ("big model, few epochs" is more efficient).

(3) The compute-optimal model size scales as $N_{\text{opt}} \propto C^{0.73}$.

Hoffmann et al. (2022) — Chinchilla Scaling Laws

Updated the Kaplan findings with the "Chinchilla" result: for compute-optimal training, the model size $N$ and dataset size $D$ should be scaled equally:

$$N_{\text{opt}} \propto C^{0.5}, \qquad D_{\text{opt}} \propto C^{0.5}$$

This means the optimal number of training tokens is approximately $D \approx 20N$ (20 tokens per parameter). Many large language models before Chinchilla were undertrained — they used too few tokens for their size.

Practical implication: Given a compute budget, the Chinchilla law tells you exactly how big your model should be and how much data you need. This converts the model design problem into a pure compute-allocation problem.

Scaling Law Application — Compute Budget Planning

Given a compute budget of $C = 10^{21}$ FLOPs:

Chinchilla-optimal:

$N \approx \sqrt{C/6} \approx \sqrt{1.67 \times 10^{20}} \approx 12.9$ billion parameters

$D \approx 20N \approx 258$ billion tokens

$\text{Training FLOPs check: } 6ND = 6 \times 12.9 \times 10^9 \times 258 \times 10^9 \approx 2 \times 10^{22}$ FLOPs

(The factor of 6 in $C \approx 6ND$ is the approximate FLOP cost per token in a Transformer: 2 for forward pass × 3 for training.)

Relevance to CNN Efficiency Research

While the Kaplan and Chinchilla scaling laws were derived for Transformers, the fundamental principle applies to CNNs: for a fixed compute budget, there exists an optimal model size and training duration. Scaling laws provide a principled framework for answering "should I train a bigger model for fewer epochs or a smaller model for more epochs?" — a central question in computational efficiency research.

For CNNs specifically, the compound scaling principle of EfficientNet ($\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$) is a scaling law designed for the multi-dimensional CNN design space (depth × width × resolution).

References for §21

[11] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

[12] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. (Chinchilla)

[13] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.

[14] Rosenfeld, J. S. et al. (2020). A Constructive Prediction of the Generalization Error Across Scales. ICLR 2020.


§ Master Efficiency Reference

Memory Optimization Techniques — Summary

Technique Memory Savings FLOP Overhead Accuracy Impact Complexity
Mixed Precision (FP16/BF16) ~50% activations ~0% (faster on tensor cores) <0.1% Low
Gradient Checkpointing $O(L) \to O(\sqrt{L})$ activations +33% 0% Low
Gradient Accumulation $B_{\text{eff}} \to B_{\text{micro}}$ activations 0% 0% Trivial
INT8 Quantization (inference) 75% model + activations 0% (faster on INT8 hw) 0.1–0.5% Medium
Pruning (structured) Proportional to sparsity Proportional reduction 0.5–2% High
Knowledge Distillation Student model size Student model FLOPs 1–5% vs teacher Medium
BN Fusion (inference) Eliminates BN memory Eliminates BN FLOPs 0% Trivial
Operator Fusion Eliminates intermediates 0% 0% Framework-level

Quick Reference — Total Training Compute

The Five Numbers You Need

For any neural network training run, compute these five quantities:

1. Forward FLOPs per sample $F$: Sum FLOP formulas for all layers

2. Total training FLOPs: $C = 3 \times F \times N_{\text{samples}} \times E$

3. Parameter memory: $M_P = (4 + 4 + 4M_{\text{opt}}) \times P$ bytes

4. Activation memory: $M_A = \text{bytes\_per\_element} \times B \times \sum_l |A_l|$

5. Training time: $T = C / (\text{Hardware FLOP/s} \times \text{utilization})$


§ Complete References

Core Survey

[1] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.

Model Analysis

[2] Canziani, A. et al. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.

Memory Optimization

[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.

[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.

Hardware & Performance

[5] Williams, S. et al. (2009). Roofline: An Insightful Visual Performance Model. Commun. ACM, 52(4), 65–76.

[6] Jia, Z. et al. (2019). Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826.

[7] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD. arXiv:1706.02677.

Inference Optimization

[8] Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018.

[9] Han, S. et al. (2016). Deep Compression. ICLR 2016.

[10] Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NIPS Workshop 2015.

Scaling Laws

[11] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

[12] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022.

[13] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.

[14] Rosenfeld, J. S. et al. (2020). A Constructive Prediction of the Generalization Error Across Scales. ICLR 2020.

Back to Notebook Index
Total visits:
§
Page visits: