Part V of VI

Computational Efficiency Analysis

FLOP counting methodology, memory budgets, hardware-aware computation, the roofline model, inference optimization, mixed precision, gradient checkpointing, and scaling laws — the complete toolkit for analyzing and optimizing neural network efficiency

17. FLOP Counting Methodology
17.1 FLOPs vs. MACs vs. FLOP/s
17.2 Counting FLOPs for Complete Networks
17.3 The 3× Rule: Forward, Backward, and Training
17.4 Total Training FLOPs Formula
18. Memory Analysis
18.1 The Four Components of Training Memory
18.2 Activation Memory — The Dominant Cost
18.3 Gradient Checkpointing (Rematerialization)
18.4 Mixed Precision Training
18.5 Gradient Accumulation
19. Hardware-Aware Computation
19.1 The Roofline Model
19.2 Compute-Bound vs. Memory-Bound Operations
19.3 GPU Hardware and Tensor Cores
19.4 Data Types and Throughput
19.5 Batch Size and Hardware Utilization
20. Inference Optimization
20.1 Operator Fusion
20.2 Quantization
20.3 Pruning
20.4 Knowledge Distillation
21. Scaling Laws
21.1 How FLOPs Scale with Model Dimensions
21.2 Neural Scaling Laws
Master Efficiency Reference
Complete References

§17 FLOP Counting Methodology

17.1 FLOPs vs. MACs vs. FLOP/s — Disambiguation

Key Definitions

FLOP (Floating-Point Operation): A single arithmetic operation on floating-point numbers — one addition, one multiplication, one comparison, etc.

MAC (Multiply-Accumulate): One multiplication followed by one addition: $a \leftarrow a + b \times c$. This counts as 2 FLOPs.

FLOPs (plural): The total count of floating-point operations in a computation. (Confusingly, some papers use "FLOPs" to mean MACs.)

FLOP/s (FLOPs per second): A measure of hardware throughput, not a count. Measured in GFLOP/s (giga) or TFLOP/s (tera).

Critical Disambiguation

There is widespread inconsistency in the literature. When a paper reports "X GFLOPs" for a model, it may mean either $X$ billion FLOPs or $X$ billion MACs (= $2X$ billion FLOPs). Always check the definition used. In this document, we consistently use FLOPs = total floating-point operations, where one MAC = 2 FLOPs. This is the convention used by Sze et al. (2017) and most hardware specifications.

For matrix multiplication of $(m \times n) \times (n \times p)$:

Our convention: $2mnp$ FLOPs (each of the $mp$ outputs requires $n$ MACs = $2n$ FLOPs)

Some papers: $mnp$ "multiply-adds" or "MACs"

17.2 Counting FLOPs for Complete Networks

Master FLOP Formulas (Forward Pass, Per Sample)

Layer Type	FLOPs	Dominant Term
Fully Connected ($n_{\text{in}} \to n_{\text{out}}$)	$2 n_{\text{in}} n_{\text{out}}$	MatMul
Conv2D ($C_i \to C_o$, $k \times k$)	$2 C_o H_o W_o C_i k^2$	MatMul (via im2col)
Depthwise Conv ($C$, $k \times k$)	$2 C H_o W_o k^2$	Per-channel conv
1×1 Conv ($C_i \to C_o$)	$2 C_o H W C_i$	MatMul
DW Separable ($C_i \to C_o$, $k \times k$)	$2 H_o W_o C_i(k^2 + C_o)$	Pointwise dominates
BatchNorm ($C$ channels, spatial $H \times W$)	$\approx 7 C H W$	Element-wise
ReLU ($n$ elements)	$n$	Comparison
Softmax ($C$ classes)	$\approx 5C$	Exp + division
Residual Addition ($n$ elements)	$n$	Element-wise add
Max Pool ($k \times k$)	$(k^2-1) C H_o W_o$	Comparisons
Global Average Pool ($C$ channels)	$CHW$	Additions

Where Do the FLOPs Go?

For typical CNNs, the distribution of FLOPs across operation types is approximately:

Operation Type	% of Forward FLOPs	Examples
Matrix Multiplication (Conv + FC)	90–99%	Conv layers, FC layers
Normalization (BN/LN)	1–5%	BatchNorm, LayerNorm
Activation Functions	0.1–1%	ReLU, GELU
Pooling	0.1–0.5%	MaxPool, GAP
Residual Connections	0.01–0.1%	Skip addition
Softmax + Loss	<0.01%	Output layer

This extreme concentration means that optimizing matrix multiplication is virtually synonymous with optimizing neural network computation.

17.3 The 3× Rule: Forward, Backward, and Training

The Fundamental Training Cost Identity

For any neural network layer performing a linear operation $\bm{Y} = \bm{X}\bm{W}\T$:

Forward: 1 matmul → $2mn_{\text{in}}n_{\text{out}}$ FLOPs

Backward (weight gradient): $\pd{L}{\bm{W}} = \bm{\delta}\T\bm{X}$ → 1 matmul → same order FLOPs

Backward (input gradient): $\pd{L}{\bm{X}} = \bm{\delta}\bm{W}$ → 1 matmul → same order FLOPs

$$\boxed{\text{FLOPs}_{\text{backward}} \approx 2 \times \text{FLOPs}_{\text{forward}}}$$ $$\boxed{\text{FLOPs}_{\text{training step}} \approx 3 \times \text{FLOPs}_{\text{forward}}}$$

This holds for every linear layer (FC, Conv, Attention) and therefore for the entire network.

17.4 Total Training FLOPs Formula

Total Training Computation

$$\text{FLOPs}_{\text{total training}} = 3 \times F_{\text{forward}} \times N_{\text{samples}} \times E$$

where $F_{\text{forward}}$ = forward FLOPs per sample, $N_{\text{samples}}$ = dataset size, $E$ = number of epochs.

Equivalently, using total tokens/samples seen:

$$\text{FLOPs}_{\text{total}} = 3 \times F_{\text{forward}} \times D$$

where $D = N_{\text{samples}} \times E$ = total samples processed during training.

Worked Example — ResNet-50 on ImageNet

Forward FLOPs per image: 3.8G (≈ 1.9G MACs)

Dataset: 1.28M images

Epochs: 90 (standard)

$$\text{FLOPs}_{\text{total}} = 3 \times 3.8 \times 10^9 \times 1.28 \times 10^6 \times 90$$ $$= 3 \times 3.8 \times 1.28 \times 90 \times 10^{15} = 1.32 \times 10^{18} \text{ FLOPs}$$ $$\approx 1.32 \text{ EFLOPs (exaFLOPs)}$$

Training time estimate: On NVIDIA V100 (125 TFLOP/s FP16, ~50% utilization):

$$t = \frac{1.32 \times 10^{18}}{125 \times 10^{12} \times 0.5} \approx 21{,}100 \text{ seconds} \approx 5.9 \text{ hours}$$

(In practice, ~6–8 hours on 8× V100 with data parallelism, consistent with this estimate.)

References for §17

[1] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks. Proc. IEEE, 105(12), 2295–2329.

[2] Canziani, A., Paszke, A., & Culurciello, E. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.

§18 Memory Analysis

Memory is frequently the binding constraint in neural network training — not FLOPs. A model that fits on a GPU can be trained; one that doesn't fit cannot, regardless of how many FLOPs the GPU can execute. Understanding memory composition in detail is therefore critical for practical efficiency research.

18.1 The Four Components of Training Memory

Training Memory Decomposition

$$\text{Memory}_{\text{training}} = M_{\text{params}} + M_{\text{gradients}} + M_{\text{optimizer}} + M_{\text{activations}} + M_{\text{temp}}$$

Component	Size (float32)	Depends On	Scales With
Parameters $\bm{\theta}$	$4P$ bytes	Architecture only	Model size
Gradients $\nabla\bm{\theta}$	$4P$ bytes	Architecture only	Model size
Optimizer state	$4M_{\text{opt}}P$ bytes	Architecture + optimizer choice	Model size
Activations (cached for backprop)	$4 \times \sum_l \|A_l\|$ bytes	Architecture + batch size	Model size × batch
Temporary buffers	Variable	Implementation	Largest single operation

$M_{\text{opt}}$: 0 (SGD), 1 (Momentum), 2 (Adam)

Figure 18.1. Training memory breakdown for ResNet-50 with Adam optimizer and batch size 32. Cached activations dominate at 85.5% of total memory — this is the primary target for memory optimization techniques like gradient checkpointing and mixed precision.

18.2 Activation Memory — The Dominant Cost

Activations must be stored during the forward pass and kept in memory until the corresponding backward pass layer processes them. For a network with $L$ layers, the activation at layer $l$ is freed only after layer $l$'s backward pass completes. The peak activation memory is therefore approximately the sum of all layer activations.

Activation Memory by Layer Type

Layer	What Must Be Cached	Size (elements per sample)
Conv ($C_o$, $H_o \times W_o$)	Input tensor (for weight grad)	$C_i \times H_i \times W_i$
ReLU	Binary mask (1 bit per element)	$C \times H \times W / 8$ bytes (packed)
BatchNorm	Normalized values, $\mu$, $\sigma^2$	$C \times H \times W + 2C$
FC ($n_{\text{out}}$)	Input vector	$n_{\text{in}}$
Dropout	Binary mask	$n / 8$ bytes (packed)
Max Pool	Argmax indices	$C \times H_o \times W_o$ (int)
Residual Add	Both inputs (skip + main branch)	$2 \times C \times H \times W$

For a deep network, these accumulate: ResNet-50 with B=32 stores ~640M float32 values ≈ 2.5 GB for activations alone.

Activation Memory Scaling

$$M_{\text{act}} \propto B \times \sum_{l=1}^{L} n_l$$

where $n_l$ is the activation tensor size at layer $l$. Key observations:

Scales linearly with batch size $B$: halving $B$ halves activation memory.

Scales linearly with depth $L$: for fixed-width networks, doubling depth doubles activation memory.

Scales quadratically with resolution: doubling input resolution (e.g., 224→448) quadruples $H \times W$ at every layer → 4× activation memory. This is why high-resolution training is so memory-hungry.

18.3 Gradient Checkpointing (Rematerialization)

Gradient checkpointing is the most important memory optimization technique for training. It trades computation for memory by discarding some activations during the forward pass and recomputing them during the backward pass.

Gradient Checkpointing — Theory

For a network with $L$ layers, divide it into $K$ segments of $L/K$ layers each. During the forward pass, store only the activations at segment boundaries (checkpoints). During the backward pass, when processing segment $k$, recompute that segment's forward pass from the nearest checkpoint.

Optimal checkpoint strategy ($\sqrt{L}$ checkpoints):

$$K = \sqrt{L}$$ $$\text{Memory}_{\text{act}} = O(\sqrt{L}) \text{ instead of } O(L)$$ $$\text{Extra FLOPs} = +33\% \text{ (one extra forward pass per segment, amortized)}$$

More precisely, the extra computation equals one additional forward pass through the entire network, meaning:

$$\text{FLOPs}_{\text{training}} = \underbrace{F}_{\text{fwd}} + \underbrace{F}_{\text{recompute}} + \underbrace{2F}_{\text{bwd}} = 4F \text{ instead of } 3F$$ $$\text{Overhead} = \frac{4F - 3F}{3F} = 33\%$$

Figure 18.2. Gradient checkpointing. Standard training stores all $L$ layer activations. Checkpointed training stores only $\sqrt{L}$ checkpoint activations and recomputes the rest during the backward pass. For a 100-layer network, this reduces activation memory by 10× at a cost of 33% more FLOPs — an extremely favorable trade-off when memory is the bottleneck.

Gradient Checkpointing — ResNet-50 Numbers

	Standard	Checkpointed (4 segments)
Activation memory	~2.5 GB	~0.6 GB
Training FLOPs per step	$3F$ (standard)	$4F$ (+33%)
Max batch size (16 GB GPU)	B ≈ 160	B ≈ 640
Effective throughput	Baseline	+300% batch size, -25% per-step speed

The larger batch size enabled by checkpointing can actually increase throughput despite the extra computation, because larger batches utilize GPU hardware more efficiently.

Reference

[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.

18.4 Mixed Precision Training

Mixed Precision Training

Store activations and compute forward/backward passes in FP16 (16-bit floating point, 2 bytes per value) while maintaining a FP32 master copy of weights for the optimizer update. This approach:

(1) Halves activation memory (2 bytes instead of 4 per element)

(2) Doubles arithmetic throughput on tensor cores (2× FP16 FLOP/s vs FP32)

(3) Halves memory bandwidth usage (move half as many bytes)

Mixed Precision — Memory Savings

Component	FP32	Mixed Precision	Savings
Parameters (master copy)	$4P$	$4P$ (keep FP32)	0%
Parameters (FP16 copy for fwd/bwd)	—	$2P$ (extra)	—
Gradients	$4P$	$2P$ (FP16)	50%
Adam $m$	$4P$	$4P$ (FP32)	0%
Adam $v$	$4P$	$4P$ (FP32)	0%
Activations	$4 \sum\|A_l\|$	$2 \sum\|A_l\|$ (FP16)	50%

The main saving is in activations (which dominate). For ResNet-50: activation memory drops from ~2.5 GB to ~1.25 GB.

Total parameter memory (Adam, mixed precision): $4P + 2P + 2P + 4P + 4P = 16P$ bytes — same as FP32 Adam (the FP16 weight copy offsets the gradient savings).

Loss Scaling — Essential for FP16 Stability

FP16 has a limited dynamic range: values below $\sim 6 \times 10^{-5}$ underflow to zero. Many gradient values in deep networks fall below this threshold. Loss scaling multiplies the loss by a large constant $S$ (e.g., $S = 1024$) before backpropagation, which scales all gradients by $S$, keeping them within FP16 range. Before the optimizer update, gradients are divided by $S$ and converted to FP32.

Dynamic loss scaling: starts with a large $S$ and halves it whenever a NaN/Inf is detected. Cost: $O(P)$ per step for the check — negligible.

Micikevicius, P. et al. (2018), "Mixed Precision Training", ICLR 2018

18.5 Gradient Accumulation

Gradient Accumulation

Run $K$ forward-backward passes with micro-batch size $B_{\text{micro}}$, accumulating gradients, then perform one optimizer update. The effective batch size is:

$$B_{\text{effective}} = K \times B_{\text{micro}}$$

Memory: Only $B_{\text{micro}}$ samples in memory at once → activation memory scales with $B_{\text{micro}}$, not $B_{\text{effective}}$.

FLOPs: Identical to training with $B_{\text{effective}}$ directly (same number of forward/backward operations).

Speed: Slightly slower than true large-batch training due to sequential processing and reduced GPU parallelism, but enables arbitrary effective batch sizes on limited memory.

Gradient Accumulation — When 32 GB Isn't Enough

Suppose you need effective batch size 256 for convergence, but your GPU only fits batch size 16:

$K = 256 / 16 = 16$ accumulation steps per optimizer update.

Memory used: Same as B=16 (not B=256).

Training FLOPs: Same as B=256 (same total samples processed).

Wall-clock time: ~16× slower than 16 GPUs doing B=16 each in parallel (data parallelism), but free in terms of hardware cost.

References for §18

[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.

[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.

[1] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks.

§19 Hardware-Aware Computation

FLOPs alone do not determine execution speed. Two operations with the same FLOP count can differ by 100× in actual runtime depending on how well they map to hardware. This section bridges the gap between theoretical FLOPs and real-world performance.

19.1 The Roofline Model

Arithmetic Intensity

The arithmetic intensity (AI) of an operation is the ratio of computation to data movement:

$$\text{AI} = \frac{\text{FLOPs}}{\text{Bytes accessed from/to memory}}$$

Units: FLOPs/byte. This is the fundamental quantity determining whether an operation is compute-bound or memory-bound.

The Roofline Model

Achievable performance is bounded by:

$$\text{Performance (FLOP/s)} \leq \min\left(\text{Peak compute (FLOP/s)},\;\; \text{AI} \times \text{Memory bandwidth (bytes/s)}\right)$$

The "roofline" is the plot of achievable FLOP/s vs. arithmetic intensity. It rises linearly with AI (memory-bound regime) until it hits the peak compute (compute-bound regime). The crossover point is:

$$\text{AI}_{\text{ridge}} = \frac{\text{Peak FLOP/s}}{\text{Memory bandwidth}}$$

Operations with $\text{AI} < \text{AI}_{\text{ridge}}$ are memory-bound; operations with $\text{AI} > \text{AI}_{\text{ridge}}$ are compute-bound.

Figure 19.1. The Roofline Model. Matrix multiplications and large convolutions have high arithmetic intensity and are compute-bound — they fully utilize the GPU's peak FLOP/s. Element-wise operations (ReLU, BatchNorm, bias addition) have low arithmetic intensity and are memory-bound — they are limited by how fast data can be moved, not how fast it can be processed.

19.2 Compute-Bound vs. Memory-Bound Operations

Arithmetic Intensity of Key Operations

Operation	FLOPs	Bytes Moved	AI (FLOP/byte)	Bound
MatMul $(M \times K) \times (K \times N)$, large	$2MKN$	$4(MK + KN + MN)$	$\approx \frac{M}{2}$ (for $M=K=N$)	Compute
Conv 3×3 (256→256, 56×56)	$\sim$1.5G	$\sim$15M bytes	~100	Compute
ReLU (1M elements)	1M	$4 \times 2 \times 1$M = 8M bytes	0.125	Memory
BatchNorm (1M elements)	$\sim$7M	$\sim$20M bytes	0.35	Memory
Bias addition (1M elements)	1M	8M bytes	0.125	Memory
Softmax (1000 classes)	~5K	~12K bytes	0.42	Memory

For an NVIDIA A100 GPU: $\text{AI}_{\text{ridge}} = \frac{312 \text{ TFLOP/s}}{2039 \text{ GB/s}} \approx 153$ FLOP/byte. Only large matrix multiplications and convolutions exceed this threshold.

Why FLOPs ≠ Speed

Consider two hypothetical operations, each with 1G FLOPs:

Operation A (one large matmul): AI ≈ 200 FLOP/byte → compute-bound at 312 TFLOP/s → time = 1G / 312T ≈ 3.2 μs

Operation B (1000 element-wise ops on 1M elements each): AI ≈ 0.25 FLOP/byte → memory-bound at 2039 GB/s → time = 4G bytes / 2039 GB/s ≈ 2 ms

Same FLOPs, but operation B is 625× slower because it's bandwidth-limited. This is why operator fusion (combining multiple element-wise ops into one kernel to avoid round-trips to memory) is critical for inference optimization.

19.3 GPU Hardware and Tensor Cores

GPU Architecture — Key Numbers

GPU	FP32 TFLOP/s	FP16 TFLOP/s	INT8 TOPS	HBM (GB)	BW (GB/s)
V100 (2017)	15.7	125	—	32	900
A100 (2020)	19.5	312	624	40/80	2,039
H100 (2022)	51	990	1,979	80	3,350

Tensor cores are specialized hardware units that compute small matrix multiplications (e.g., 4×4×4) in a single cycle. They provide the FP16/BF16 throughput numbers above. Key: tensor cores only accelerate matrix multiplication, not element-wise operations — this is another reason matmul dominates and element-wise ops are the bottleneck.

Utilization: Real workloads typically achieve 30–60% of peak TFLOP/s due to memory transfers, kernel launch overhead, synchronization, and imperfect tiling. Well-optimized large matmuls can reach 70–85%.

19.4 Data Types and Throughput

Numerical Formats in Deep Learning

Format	Bits	Bytes	Range	Precision	Use	Relative Throughput
FP32	32	4	$\pm 3.4 \times 10^{38}$	~7 decimal digits	Training (master weights)	1×
FP16	16	2	$\pm 6.5 \times 10^4$	~3.3 digits	Training (mixed precision)	2× (V100), 16× (A100 TC)
BF16	16	2	$\pm 3.4 \times 10^{38}$	~2.4 digits	Training (robust to overflow)	Same as FP16
INT8	8	1	$-128$ to $127$	Integer only	Inference quantization	4× (A100)
INT4	4	0.5	$-8$ to $7$	Very low	Weight quantization	8× (theoretical)

Key insight: Moving from FP32 to FP16/BF16 gives: (a) 2× memory savings for activations, (b) 2–16× compute throughput improvement, (c) 2× memory bandwidth improvement. This is why mixed precision is nearly universal for training.

BF16 vs. FP16: BF16 has the same exponent range as FP32 (no overflow risk) but lower precision. FP16 has higher precision but requires loss scaling to handle small gradients. BF16 is preferred on newer hardware (A100+) because it eliminates the need for loss scaling.

19.5 Batch Size and Hardware Utilization

Batch Size → Utilization Relationship

Matrix multiplication of $(\text{B} \times n_{\text{in}}) \times (n_{\text{in}} \times n_{\text{out}})$ has arithmetic intensity:

$$\text{AI} = \frac{2B \cdot n_{\text{in}} \cdot n_{\text{out}}}{4(B \cdot n_{\text{in}} + n_{\text{in}} \cdot n_{\text{out}} + B \cdot n_{\text{out}})}$$

For small $B$: $\text{AI} \approx \frac{B}{2}$ — directly proportional to batch size. Doubling $B$ doubles AI, moving the operation further into the compute-bound regime.

For large $B$ (where $B \gg n$): $\text{AI} \approx \frac{n}{2}$ — saturates at the layer width.

Practical Batch Size Guidelines

B too small (e.g., 1–4): Matrix multiplications become memory-bound. GPU utilization <10%. Most time spent moving data, not computing.

B moderate (e.g., 32–256): Good compute-bound regime for typical layer sizes. GPU utilization 40–70%.

B large (e.g., 1024+): Near-peak utilization, but: (a) may require more epochs to converge (generalization gap), (b) requires linear scaling of learning rate (Goyal et al., 2017), (c) may exceed GPU memory.

The optimal batch size balances GPU utilization against convergence behavior. This is fundamentally a hardware-algorithm co-design problem.

References for §19

[5] Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, 52(4), 65–76.

[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.

[6] Jia, Z. et al. (2019). Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826.

[7] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD. arXiv:1706.02677.

§20 Inference Optimization

Inference has fundamentally different constraints from training: no backpropagation (so no activation storage), typically smaller batch sizes (often B=1 for real-time applications), and a premium on latency rather than throughput. These constraints call for different optimization strategies.

20.1 Operator Fusion

Operator Fusion

Combining multiple sequential operations into a single GPU kernel, eliminating intermediate memory reads/writes. Common fusions:

Fused Operation	Components	Memory Savings
Conv-BN-ReLU	Convolution + BatchNorm + ReLU	Eliminates 2 intermediate tensors
Linear-GELU	FC layer + GELU activation	Eliminates 1 intermediate tensor
Residual-BN-ReLU-Add	BN + ReLU + skip addition	Eliminates 2 intermediate tensors

FLOP count: Unchanged — fusion does not reduce arithmetic, only memory traffic.

Speedup: 1.2–2× for memory-bound operation chains (which includes most non-matmul operations).

At inference, BatchNorm can be fully folded into the preceding conv layer (as shown in Part IV §15.1), making BN completely free.

20.2 Quantization

Post-Training Quantization (PTQ)

Convert trained FP32 weights and activations to lower precision (INT8, INT4) for inference:

$$q = \text{round}\!\left(\frac{x - z}{s}\right), \qquad x \approx s \cdot q + z$$

where $s$ is the scale factor and $z$ is the zero-point.

Quantization — Efficiency Impact

Metric	FP32	FP16	INT8	INT4
Bytes per value	4	2	1	0.5
Model size (relative)	1×	0.5×	0.25×	0.125×
Memory bandwidth (relative)	1×	2×	4×	8×
Compute throughput (vs FP32)	1×	2–16×	4–32×	8–64× (emerging)
Typical accuracy loss	—	<0.1%	0.1–0.5%	0.5–3%

INT8 quantization is the current sweet spot: 4× model compression, 4× bandwidth improvement, and <0.5% accuracy loss for most models. Frameworks like TensorRT, ONNX Runtime, and PyTorch all support INT8 inference natively.

20.3 Pruning

Weight Pruning

Unstructured pruning: Set individual weights to zero. Sparsity of 80–95% is typical.

$$\text{FLOPs}_{\text{theoretical}} = (1 - s) \times \text{FLOPs}_{\text{dense}}$$

where $s$ is sparsity fraction. But: unstructured sparsity is hard to accelerate on GPUs (irregular memory access patterns). Real speedup is often only 1.5–2× even at 90% sparsity, versus the theoretical 10×.

Structured pruning: Remove entire channels, filters, or layers. This produces dense, regular tensors that map directly to existing hardware.

$$\text{FLOPs}_{\text{structured}} = \text{FLOPs of smaller dense model}$$

No special hardware support needed — equivalent to training a smaller model. Speedup is proportional to FLOP reduction.

20.4 Knowledge Distillation

Knowledge Distillation

Train a small "student" model to mimic the outputs (soft labels) of a large "teacher" model:

$$L_{\text{distill}} = \alpha \cdot L_{\text{CE}}(\hat{y}_{\text{student}}, y_{\text{true}}) + (1-\alpha) \cdot T^2 \cdot D_{\text{KL}}\!\left(\text{softmax}\!\left(\frac{z_{\text{student}}}{T}\right) \;\Big\|\; \text{softmax}\!\left(\frac{z_{\text{teacher}}}{T}\right)\right)$$

where $T$ is the temperature (typically 3–20) and $\alpha$ balances hard and soft labels.

Metric	Value
Training cost	Cost of running teacher (once) + training student (standard)
Inference cost	Only the student model — teacher is discarded
Typical results	Student achieves 95–99% of teacher accuracy at 3–10× fewer FLOPs

Hinton, G. et al. (2015), "Distilling the Knowledge in a Neural Network"

References for §20

[8] Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018.

[9] Han, S., Mao, H., & Dally, W. J. (2016). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR 2016.

[10] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Workshop 2015.

§21 Scaling Laws

21.1 How FLOPs Scale with Model Dimensions

FLOP Scaling for MLPs and CNNs

Scaling Dimension	Parameter Scaling	FLOP Scaling (per sample)	FLOP Scaling (training)
Width × $\alpha$ (all layers)	$\propto \alpha^2$	$\propto \alpha^2$	$\propto \alpha^2$
Depth × $\alpha$ (add layers)	$\propto \alpha$	$\propto \alpha$	$\propto \alpha$
Resolution × $\alpha$ (input size)	unchanged (CNN)	$\propto \alpha^2$ (spatial dimensions)	$\propto \alpha^2$
Batch size × $\alpha$	unchanged	unchanged (per sample)	unchanged (same total samples)
Epochs × $\alpha$	unchanged	unchanged	$\propto \alpha$

Width is the most expensive dimension to scale — doubling all layer widths quadruples both parameters and FLOPs. Depth scaling is linear in both. Resolution scaling is quadratic in FLOPs but doesn't add parameters (for CNNs with weight sharing).

This is precisely why EfficientNet's compound scaling ($d = \alpha^\phi$, $w = \beta^\phi$, $r = \gamma^\phi$ with $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$) is designed to balance these different scaling rates.

21.2 Neural Scaling Laws

Kaplan et al. (2020) — Scaling Laws for Neural Language Models

For Transformer language models, test loss follows power laws in three variables:

$$L(N) \propto N^{-\alpha_N}, \qquad L(D) \propto D^{-\alpha_D}, \qquad L(C) \propto C^{-\alpha_C}$$

where $N$ = model parameters, $D$ = dataset size (tokens), $C$ = training compute (FLOPs).

Empirically: $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, $\alpha_C \approx 0.050$.

Key findings:

(1) Performance improves predictably as a power law in compute — no surprises, no plateaus (within the studied range).

(2) Given a fixed compute budget $C$, the optimal allocation is to use a large model trained on a moderate amount of data, rather than a small model trained on a lot of data ("big model, few epochs" is more efficient).

(3) The compute-optimal model size scales as $N_{\text{opt}} \propto C^{0.73}$.

Hoffmann et al. (2022) — Chinchilla Scaling Laws

Updated the Kaplan findings with the "Chinchilla" result: for compute-optimal training, the model size $N$ and dataset size $D$ should be scaled equally:

$$N_{\text{opt}} \propto C^{0.5}, \qquad D_{\text{opt}} \propto C^{0.5}$$

This means the optimal number of training tokens is approximately $D \approx 20N$ (20 tokens per parameter). Many large language models before Chinchilla were undertrained — they used too few tokens for their size.

Practical implication: Given a compute budget, the Chinchilla law tells you exactly how big your model should be and how much data you need. This converts the model design problem into a pure compute-allocation problem.

Scaling Law Application — Compute Budget Planning

Given a compute budget of $C = 10^{21}$ FLOPs:

Chinchilla-optimal:

$N \approx \sqrt{C/6} \approx \sqrt{1.67 \times 10^{20}} \approx 12.9$ billion parameters

$D \approx 20N \approx 258$ billion tokens

$\text{Training FLOPs check: } 6ND = 6 \times 12.9 \times 10^9 \times 258 \times 10^9 \approx 2 \times 10^{22}$ FLOPs

(The factor of 6 in $C \approx 6ND$ is the approximate FLOP cost per token in a Transformer: 2 for forward pass × 3 for training.)

Relevance to CNN Efficiency Research

While the Kaplan and Chinchilla scaling laws were derived for Transformers, the fundamental principle applies to CNNs: for a fixed compute budget, there exists an optimal model size and training duration. Scaling laws provide a principled framework for answering "should I train a bigger model for fewer epochs or a smaller model for more epochs?" — a central question in computational efficiency research.

For CNNs specifically, the compound scaling principle of EfficientNet ($\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$) is a scaling law designed for the multi-dimensional CNN design space (depth × width × resolution).

References for §21

[11] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

[12] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. (Chinchilla)

[13] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.

[14] Rosenfeld, J. S. et al. (2020). A Constructive Prediction of the Generalization Error Across Scales. ICLR 2020.

§ Master Efficiency Reference

Memory Optimization Techniques — Summary

Technique	Memory Savings	FLOP Overhead	Accuracy Impact	Complexity
Mixed Precision (FP16/BF16)	~50% activations	~0% (faster on tensor cores)	<0.1%	Low
Gradient Checkpointing	$O(L) \to O(\sqrt{L})$ activations	+33%	0%	Low
Gradient Accumulation	$B_{\text{eff}} \to B_{\text{micro}}$ activations	0%	0%	Trivial
INT8 Quantization (inference)	75% model + activations	0% (faster on INT8 hw)	0.1–0.5%	Medium
Pruning (structured)	Proportional to sparsity	Proportional reduction	0.5–2%	High
Knowledge Distillation	Student model size	Student model FLOPs	1–5% vs teacher	Medium
BN Fusion (inference)	Eliminates BN memory	Eliminates BN FLOPs	0%	Trivial
Operator Fusion	Eliminates intermediates	0%	0%	Framework-level

Quick Reference — Total Training Compute

The Five Numbers You Need

For any neural network training run, compute these five quantities:

1. Forward FLOPs per sample $F$: Sum FLOP formulas for all layers

2. Total training FLOPs: $C = 3 \times F \times N_{\text{samples}} \times E$

3. Parameter memory: $M_P = (4 + 4 + 4M_{\text{opt}}) \times P$ bytes

4. Activation memory: $M_A = \text{bytes\_per\_element} \times B \times \sum_l |A_l|$

5. Training time: $T = C / (\text{Hardware FLOP/s} \times \text{utilization})$

§ Complete References

Core Survey

[1] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.

Model Analysis

[2] Canziani, A. et al. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.

Memory Optimization

[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.

[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.

Hardware & Performance

[5] Williams, S. et al. (2009). Roofline: An Insightful Visual Performance Model. Commun. ACM, 52(4), 65–76.

[6] Jia, Z. et al. (2019). Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826.

[7] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD. arXiv:1706.02677.

Inference Optimization

[8] Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018.

[9] Han, S. et al. (2016). Deep Compression. ICLR 2016.

[10] Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NIPS Workshop 2015.

Scaling Laws

[11] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

[12] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022.

[13] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.

[14] Rosenfeld, J. S. et al. (2020). A Constructive Prediction of the Generalization Error Across Scales. ICLR 2020.

Back to Notebook Index

Total visits:

Page visits:

Computational Efficiency Analysis

Contents

§17 FLOP Counting Methodology

17.1 FLOPs vs. MACs vs. FLOP/s — Disambiguation

17.2 Counting FLOPs for Complete Networks

17.3 The 3× Rule: Forward, Backward, and Training

17.4 Total Training FLOPs Formula

References for §17

§18 Memory Analysis

18.1 The Four Components of Training Memory

18.2 Activation Memory — The Dominant Cost

18.3 Gradient Checkpointing (Rematerialization)

Reference

18.4 Mixed Precision Training

18.5 Gradient Accumulation

References for §18

§19 Hardware-Aware Computation

19.1 The Roofline Model

19.2 Compute-Bound vs. Memory-Bound Operations

19.3 GPU Hardware and Tensor Cores

19.4 Data Types and Throughput

19.5 Batch Size and Hardware Utilization

References for §19

§20 Inference Optimization

20.1 Operator Fusion

20.2 Quantization

20.3 Pruning

20.4 Knowledge Distillation

References for §20

§21 Scaling Laws

21.1 How FLOPs Scale with Model Dimensions

21.2 Neural Scaling Laws

References for §21

§ Master Efficiency Reference

Memory Optimization Techniques — Summary

Quick Reference — Total Training Compute

§ Complete References

Core Survey

Model Analysis

Memory Optimization

Hardware & Performance

Inference Optimization

Scaling Laws