FLOP (Floating-Point Operation): A single arithmetic operation on floating-point numbers — one addition, one multiplication, one comparison, etc.
MAC (Multiply-Accumulate): One multiplication followed by one addition: $a \leftarrow a + b \times c$. This counts as 2 FLOPs.
FLOPs (plural): The total count of floating-point operations in a computation. (Confusingly, some papers use "FLOPs" to mean MACs.)
FLOP/s (FLOPs per second): A measure of hardware throughput, not a count. Measured in GFLOP/s (giga) or TFLOP/s (tera).
There is widespread inconsistency in the literature. When a paper reports "X GFLOPs" for a model, it may mean either $X$ billion FLOPs or $X$ billion MACs (= $2X$ billion FLOPs). Always check the definition used. In this document, we consistently use FLOPs = total floating-point operations, where one MAC = 2 FLOPs. This is the convention used by Sze et al. (2017) and most hardware specifications.
For matrix multiplication of $(m \times n) \times (n \times p)$:
Our convention: $2mnp$ FLOPs (each of the $mp$ outputs requires $n$ MACs = $2n$ FLOPs)
Some papers: $mnp$ "multiply-adds" or "MACs"
| Layer Type | FLOPs | Dominant Term |
|---|---|---|
| Fully Connected ($n_{\text{in}} \to n_{\text{out}}$) | $2 n_{\text{in}} n_{\text{out}}$ | MatMul |
| Conv2D ($C_i \to C_o$, $k \times k$) | $2 C_o H_o W_o C_i k^2$ | MatMul (via im2col) |
| Depthwise Conv ($C$, $k \times k$) | $2 C H_o W_o k^2$ | Per-channel conv |
| 1×1 Conv ($C_i \to C_o$) | $2 C_o H W C_i$ | MatMul |
| DW Separable ($C_i \to C_o$, $k \times k$) | $2 H_o W_o C_i(k^2 + C_o)$ | Pointwise dominates |
| BatchNorm ($C$ channels, spatial $H \times W$) | $\approx 7 C H W$ | Element-wise |
| ReLU ($n$ elements) | $n$ | Comparison |
| Softmax ($C$ classes) | $\approx 5C$ | Exp + division |
| Residual Addition ($n$ elements) | $n$ | Element-wise add |
| Max Pool ($k \times k$) | $(k^2-1) C H_o W_o$ | Comparisons |
| Global Average Pool ($C$ channels) | $CHW$ | Additions |
For typical CNNs, the distribution of FLOPs across operation types is approximately:
| Operation Type | % of Forward FLOPs | Examples |
|---|---|---|
| Matrix Multiplication (Conv + FC) | 90–99% | Conv layers, FC layers |
| Normalization (BN/LN) | 1–5% | BatchNorm, LayerNorm |
| Activation Functions | 0.1–1% | ReLU, GELU |
| Pooling | 0.1–0.5% | MaxPool, GAP |
| Residual Connections | 0.01–0.1% | Skip addition |
| Softmax + Loss | <0.01% | Output layer |
This extreme concentration means that optimizing matrix multiplication is virtually synonymous with optimizing neural network computation.
For any neural network layer performing a linear operation $\bm{Y} = \bm{X}\bm{W}\T$:
Forward: 1 matmul → $2mn_{\text{in}}n_{\text{out}}$ FLOPs
Backward (weight gradient): $\pd{L}{\bm{W}} = \bm{\delta}\T\bm{X}$ → 1 matmul → same order FLOPs
Backward (input gradient): $\pd{L}{\bm{X}} = \bm{\delta}\bm{W}$ → 1 matmul → same order FLOPs
$$\boxed{\text{FLOPs}_{\text{backward}} \approx 2 \times \text{FLOPs}_{\text{forward}}}$$ $$\boxed{\text{FLOPs}_{\text{training step}} \approx 3 \times \text{FLOPs}_{\text{forward}}}$$This holds for every linear layer (FC, Conv, Attention) and therefore for the entire network.
where $F_{\text{forward}}$ = forward FLOPs per sample, $N_{\text{samples}}$ = dataset size, $E$ = number of epochs.
Equivalently, using total tokens/samples seen:
$$\text{FLOPs}_{\text{total}} = 3 \times F_{\text{forward}} \times D$$where $D = N_{\text{samples}} \times E$ = total samples processed during training.
Forward FLOPs per image: 3.8G (≈ 1.9G MACs)
Dataset: 1.28M images
Epochs: 90 (standard)
$$\text{FLOPs}_{\text{total}} = 3 \times 3.8 \times 10^9 \times 1.28 \times 10^6 \times 90$$ $$= 3 \times 3.8 \times 1.28 \times 90 \times 10^{15} = 1.32 \times 10^{18} \text{ FLOPs}$$ $$\approx 1.32 \text{ EFLOPs (exaFLOPs)}$$Training time estimate: On NVIDIA V100 (125 TFLOP/s FP16, ~50% utilization):
$$t = \frac{1.32 \times 10^{18}}{125 \times 10^{12} \times 0.5} \approx 21{,}100 \text{ seconds} \approx 5.9 \text{ hours}$$(In practice, ~6–8 hours on 8× V100 with data parallelism, consistent with this estimate.)
[1] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks. Proc. IEEE, 105(12), 2295–2329.
[2] Canziani, A., Paszke, A., & Culurciello, E. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.
Memory is frequently the binding constraint in neural network training — not FLOPs. A model that fits on a GPU can be trained; one that doesn't fit cannot, regardless of how many FLOPs the GPU can execute. Understanding memory composition in detail is therefore critical for practical efficiency research.
| Component | Size (float32) | Depends On | Scales With |
|---|---|---|---|
| Parameters $\bm{\theta}$ | $4P$ bytes | Architecture only | Model size |
| Gradients $\nabla\bm{\theta}$ | $4P$ bytes | Architecture only | Model size |
| Optimizer state | $4M_{\text{opt}}P$ bytes | Architecture + optimizer choice | Model size |
| Activations (cached for backprop) | $4 \times \sum_l |A_l|$ bytes | Architecture + batch size | Model size × batch |
| Temporary buffers | Variable | Implementation | Largest single operation |
$M_{\text{opt}}$: 0 (SGD), 1 (Momentum), 2 (Adam)
Activations must be stored during the forward pass and kept in memory until the corresponding backward pass layer processes them. For a network with $L$ layers, the activation at layer $l$ is freed only after layer $l$'s backward pass completes. The peak activation memory is therefore approximately the sum of all layer activations.
| Layer | What Must Be Cached | Size (elements per sample) |
|---|---|---|
| Conv ($C_o$, $H_o \times W_o$) | Input tensor (for weight grad) | $C_i \times H_i \times W_i$ |
| ReLU | Binary mask (1 bit per element) | $C \times H \times W / 8$ bytes (packed) |
| BatchNorm | Normalized values, $\mu$, $\sigma^2$ | $C \times H \times W + 2C$ |
| FC ($n_{\text{out}}$) | Input vector | $n_{\text{in}}$ |
| Dropout | Binary mask | $n / 8$ bytes (packed) |
| Max Pool | Argmax indices | $C \times H_o \times W_o$ (int) |
| Residual Add | Both inputs (skip + main branch) | $2 \times C \times H \times W$ |
For a deep network, these accumulate: ResNet-50 with B=32 stores ~640M float32 values ≈ 2.5 GB for activations alone.
where $n_l$ is the activation tensor size at layer $l$. Key observations:
Scales linearly with batch size $B$: halving $B$ halves activation memory.
Scales linearly with depth $L$: for fixed-width networks, doubling depth doubles activation memory.
Scales quadratically with resolution: doubling input resolution (e.g., 224→448) quadruples $H \times W$ at every layer → 4× activation memory. This is why high-resolution training is so memory-hungry.
Gradient checkpointing is the most important memory optimization technique for training. It trades computation for memory by discarding some activations during the forward pass and recomputing them during the backward pass.
For a network with $L$ layers, divide it into $K$ segments of $L/K$ layers each. During the forward pass, store only the activations at segment boundaries (checkpoints). During the backward pass, when processing segment $k$, recompute that segment's forward pass from the nearest checkpoint.
Optimal checkpoint strategy ($\sqrt{L}$ checkpoints):
$$K = \sqrt{L}$$ $$\text{Memory}_{\text{act}} = O(\sqrt{L}) \text{ instead of } O(L)$$ $$\text{Extra FLOPs} = +33\% \text{ (one extra forward pass per segment, amortized)}$$More precisely, the extra computation equals one additional forward pass through the entire network, meaning:
$$\text{FLOPs}_{\text{training}} = \underbrace{F}_{\text{fwd}} + \underbrace{F}_{\text{recompute}} + \underbrace{2F}_{\text{bwd}} = 4F \text{ instead of } 3F$$ $$\text{Overhead} = \frac{4F - 3F}{3F} = 33\%$$| Standard | Checkpointed (4 segments) | |
|---|---|---|
| Activation memory | ~2.5 GB | ~0.6 GB |
| Training FLOPs per step | $3F$ (standard) | $4F$ (+33%) |
| Max batch size (16 GB GPU) | B ≈ 160 | B ≈ 640 |
| Effective throughput | Baseline | +300% batch size, -25% per-step speed |
The larger batch size enabled by checkpointing can actually increase throughput despite the extra computation, because larger batches utilize GPU hardware more efficiently.
[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.
Store activations and compute forward/backward passes in FP16 (16-bit floating point, 2 bytes per value) while maintaining a FP32 master copy of weights for the optimizer update. This approach:
(1) Halves activation memory (2 bytes instead of 4 per element)
(2) Doubles arithmetic throughput on tensor cores (2× FP16 FLOP/s vs FP32)
(3) Halves memory bandwidth usage (move half as many bytes)
| Component | FP32 | Mixed Precision | Savings |
|---|---|---|---|
| Parameters (master copy) | $4P$ | $4P$ (keep FP32) | 0% |
| Parameters (FP16 copy for fwd/bwd) | — | $2P$ (extra) | — |
| Gradients | $4P$ | $2P$ (FP16) | 50% |
| Adam $m$ | $4P$ | $4P$ (FP32) | 0% |
| Adam $v$ | $4P$ | $4P$ (FP32) | 0% |
| Activations | $4 \sum|A_l|$ | $2 \sum|A_l|$ (FP16) | 50% |
The main saving is in activations (which dominate). For ResNet-50: activation memory drops from ~2.5 GB to ~1.25 GB.
Total parameter memory (Adam, mixed precision): $4P + 2P + 2P + 4P + 4P = 16P$ bytes — same as FP32 Adam (the FP16 weight copy offsets the gradient savings).
FP16 has a limited dynamic range: values below $\sim 6 \times 10^{-5}$ underflow to zero. Many gradient values in deep networks fall below this threshold. Loss scaling multiplies the loss by a large constant $S$ (e.g., $S = 1024$) before backpropagation, which scales all gradients by $S$, keeping them within FP16 range. Before the optimizer update, gradients are divided by $S$ and converted to FP32.
Dynamic loss scaling: starts with a large $S$ and halves it whenever a NaN/Inf is detected. Cost: $O(P)$ per step for the check — negligible.
Micikevicius, P. et al. (2018), "Mixed Precision Training", ICLR 2018
Run $K$ forward-backward passes with micro-batch size $B_{\text{micro}}$, accumulating gradients, then perform one optimizer update. The effective batch size is:
$$B_{\text{effective}} = K \times B_{\text{micro}}$$Memory: Only $B_{\text{micro}}$ samples in memory at once → activation memory scales with $B_{\text{micro}}$, not $B_{\text{effective}}$.
FLOPs: Identical to training with $B_{\text{effective}}$ directly (same number of forward/backward operations).
Speed: Slightly slower than true large-batch training due to sequential processing and reduced GPU parallelism, but enables arbitrary effective batch sizes on limited memory.
Suppose you need effective batch size 256 for convergence, but your GPU only fits batch size 16:
$K = 256 / 16 = 16$ accumulation steps per optimizer update.
Memory used: Same as B=16 (not B=256).
Training FLOPs: Same as B=256 (same total samples processed).
Wall-clock time: ~16× slower than 16 GPUs doing B=16 each in parallel (data parallelism), but free in terms of hardware cost.
[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.
[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.
[1] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks.
FLOPs alone do not determine execution speed. Two operations with the same FLOP count can differ by 100× in actual runtime depending on how well they map to hardware. This section bridges the gap between theoretical FLOPs and real-world performance.
The arithmetic intensity (AI) of an operation is the ratio of computation to data movement:
$$\text{AI} = \frac{\text{FLOPs}}{\text{Bytes accessed from/to memory}}$$Units: FLOPs/byte. This is the fundamental quantity determining whether an operation is compute-bound or memory-bound.
Achievable performance is bounded by:
$$\text{Performance (FLOP/s)} \leq \min\left(\text{Peak compute (FLOP/s)},\;\; \text{AI} \times \text{Memory bandwidth (bytes/s)}\right)$$The "roofline" is the plot of achievable FLOP/s vs. arithmetic intensity. It rises linearly with AI (memory-bound regime) until it hits the peak compute (compute-bound regime). The crossover point is:
$$\text{AI}_{\text{ridge}} = \frac{\text{Peak FLOP/s}}{\text{Memory bandwidth}}$$Operations with $\text{AI} < \text{AI}_{\text{ridge}}$ are memory-bound; operations with $\text{AI} > \text{AI}_{\text{ridge}}$ are compute-bound.
| Operation | FLOPs | Bytes Moved | AI (FLOP/byte) | Bound |
|---|---|---|---|---|
| MatMul $(M \times K) \times (K \times N)$, large | $2MKN$ | $4(MK + KN + MN)$ | $\approx \frac{M}{2}$ (for $M=K=N$) | Compute |
| Conv 3×3 (256→256, 56×56) | $\sim$1.5G | $\sim$15M bytes | ~100 | Compute |
| ReLU (1M elements) | 1M | $4 \times 2 \times 1$M = 8M bytes | 0.125 | Memory |
| BatchNorm (1M elements) | $\sim$7M | $\sim$20M bytes | 0.35 | Memory |
| Bias addition (1M elements) | 1M | 8M bytes | 0.125 | Memory |
| Softmax (1000 classes) | ~5K | ~12K bytes | 0.42 | Memory |
For an NVIDIA A100 GPU: $\text{AI}_{\text{ridge}} = \frac{312 \text{ TFLOP/s}}{2039 \text{ GB/s}} \approx 153$ FLOP/byte. Only large matrix multiplications and convolutions exceed this threshold.
Consider two hypothetical operations, each with 1G FLOPs:
Operation A (one large matmul): AI ≈ 200 FLOP/byte → compute-bound at 312 TFLOP/s → time = 1G / 312T ≈ 3.2 μs
Operation B (1000 element-wise ops on 1M elements each): AI ≈ 0.25 FLOP/byte → memory-bound at 2039 GB/s → time = 4G bytes / 2039 GB/s ≈ 2 ms
Same FLOPs, but operation B is 625× slower because it's bandwidth-limited. This is why operator fusion (combining multiple element-wise ops into one kernel to avoid round-trips to memory) is critical for inference optimization.
| GPU | FP32 TFLOP/s | FP16 TFLOP/s | INT8 TOPS | HBM (GB) | BW (GB/s) |
|---|---|---|---|---|---|
| V100 (2017) | 15.7 | 125 | — | 32 | 900 |
| A100 (2020) | 19.5 | 312 | 624 | 40/80 | 2,039 |
| H100 (2022) | 51 | 990 | 1,979 | 80 | 3,350 |
Tensor cores are specialized hardware units that compute small matrix multiplications (e.g., 4×4×4) in a single cycle. They provide the FP16/BF16 throughput numbers above. Key: tensor cores only accelerate matrix multiplication, not element-wise operations — this is another reason matmul dominates and element-wise ops are the bottleneck.
Utilization: Real workloads typically achieve 30–60% of peak TFLOP/s due to memory transfers, kernel launch overhead, synchronization, and imperfect tiling. Well-optimized large matmuls can reach 70–85%.
| Format | Bits | Bytes | Range | Precision | Use | Relative Throughput |
|---|---|---|---|---|---|---|
| FP32 | 32 | 4 | $\pm 3.4 \times 10^{38}$ | ~7 decimal digits | Training (master weights) | 1× |
| FP16 | 16 | 2 | $\pm 6.5 \times 10^4$ | ~3.3 digits | Training (mixed precision) | 2× (V100), 16× (A100 TC) |
| BF16 | 16 | 2 | $\pm 3.4 \times 10^{38}$ | ~2.4 digits | Training (robust to overflow) | Same as FP16 |
| INT8 | 8 | 1 | $-128$ to $127$ | Integer only | Inference quantization | 4× (A100) |
| INT4 | 4 | 0.5 | $-8$ to $7$ | Very low | Weight quantization | 8× (theoretical) |
Key insight: Moving from FP32 to FP16/BF16 gives: (a) 2× memory savings for activations, (b) 2–16× compute throughput improvement, (c) 2× memory bandwidth improvement. This is why mixed precision is nearly universal for training.
BF16 vs. FP16: BF16 has the same exponent range as FP32 (no overflow risk) but lower precision. FP16 has higher precision but requires loss scaling to handle small gradients. BF16 is preferred on newer hardware (A100+) because it eliminates the need for loss scaling.
Matrix multiplication of $(\text{B} \times n_{\text{in}}) \times (n_{\text{in}} \times n_{\text{out}})$ has arithmetic intensity:
$$\text{AI} = \frac{2B \cdot n_{\text{in}} \cdot n_{\text{out}}}{4(B \cdot n_{\text{in}} + n_{\text{in}} \cdot n_{\text{out}} + B \cdot n_{\text{out}})}$$For small $B$: $\text{AI} \approx \frac{B}{2}$ — directly proportional to batch size. Doubling $B$ doubles AI, moving the operation further into the compute-bound regime.
For large $B$ (where $B \gg n$): $\text{AI} \approx \frac{n}{2}$ — saturates at the layer width.
B too small (e.g., 1–4): Matrix multiplications become memory-bound. GPU utilization <10%. Most time spent moving data, not computing.
B moderate (e.g., 32–256): Good compute-bound regime for typical layer sizes. GPU utilization 40–70%.
B large (e.g., 1024+): Near-peak utilization, but: (a) may require more epochs to converge (generalization gap), (b) requires linear scaling of learning rate (Goyal et al., 2017), (c) may exceed GPU memory.
The optimal batch size balances GPU utilization against convergence behavior. This is fundamentally a hardware-algorithm co-design problem.
[5] Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, 52(4), 65–76.
[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.
[6] Jia, Z. et al. (2019). Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826.
[7] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD. arXiv:1706.02677.
Inference has fundamentally different constraints from training: no backpropagation (so no activation storage), typically smaller batch sizes (often B=1 for real-time applications), and a premium on latency rather than throughput. These constraints call for different optimization strategies.
Combining multiple sequential operations into a single GPU kernel, eliminating intermediate memory reads/writes. Common fusions:
| Fused Operation | Components | Memory Savings |
|---|---|---|
| Conv-BN-ReLU | Convolution + BatchNorm + ReLU | Eliminates 2 intermediate tensors |
| Linear-GELU | FC layer + GELU activation | Eliminates 1 intermediate tensor |
| Residual-BN-ReLU-Add | BN + ReLU + skip addition | Eliminates 2 intermediate tensors |
FLOP count: Unchanged — fusion does not reduce arithmetic, only memory traffic.
Speedup: 1.2–2× for memory-bound operation chains (which includes most non-matmul operations).
At inference, BatchNorm can be fully folded into the preceding conv layer (as shown in Part IV §15.1), making BN completely free.
Convert trained FP32 weights and activations to lower precision (INT8, INT4) for inference:
$$q = \text{round}\!\left(\frac{x - z}{s}\right), \qquad x \approx s \cdot q + z$$where $s$ is the scale factor and $z$ is the zero-point.
| Metric | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|
| Bytes per value | 4 | 2 | 1 | 0.5 |
| Model size (relative) | 1× | 0.5× | 0.25× | 0.125× |
| Memory bandwidth (relative) | 1× | 2× | 4× | 8× |
| Compute throughput (vs FP32) | 1× | 2–16× | 4–32× | 8–64× (emerging) |
| Typical accuracy loss | — | <0.1% | 0.1–0.5% | 0.5–3% |
INT8 quantization is the current sweet spot: 4× model compression, 4× bandwidth improvement, and <0.5% accuracy loss for most models. Frameworks like TensorRT, ONNX Runtime, and PyTorch all support INT8 inference natively.
Unstructured pruning: Set individual weights to zero. Sparsity of 80–95% is typical.
$$\text{FLOPs}_{\text{theoretical}} = (1 - s) \times \text{FLOPs}_{\text{dense}}$$where $s$ is sparsity fraction. But: unstructured sparsity is hard to accelerate on GPUs (irregular memory access patterns). Real speedup is often only 1.5–2× even at 90% sparsity, versus the theoretical 10×.
Structured pruning: Remove entire channels, filters, or layers. This produces dense, regular tensors that map directly to existing hardware.
$$\text{FLOPs}_{\text{structured}} = \text{FLOPs of smaller dense model}$$No special hardware support needed — equivalent to training a smaller model. Speedup is proportional to FLOP reduction.
Train a small "student" model to mimic the outputs (soft labels) of a large "teacher" model:
$$L_{\text{distill}} = \alpha \cdot L_{\text{CE}}(\hat{y}_{\text{student}}, y_{\text{true}}) + (1-\alpha) \cdot T^2 \cdot D_{\text{KL}}\!\left(\text{softmax}\!\left(\frac{z_{\text{student}}}{T}\right) \;\Big\|\; \text{softmax}\!\left(\frac{z_{\text{teacher}}}{T}\right)\right)$$where $T$ is the temperature (typically 3–20) and $\alpha$ balances hard and soft labels.
| Metric | Value |
|---|---|
| Training cost | Cost of running teacher (once) + training student (standard) |
| Inference cost | Only the student model — teacher is discarded |
| Typical results | Student achieves 95–99% of teacher accuracy at 3–10× fewer FLOPs |
Hinton, G. et al. (2015), "Distilling the Knowledge in a Neural Network"
[8] Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018.
[9] Han, S., Mao, H., & Dally, W. J. (2016). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR 2016.
[10] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Workshop 2015.
| Scaling Dimension | Parameter Scaling | FLOP Scaling (per sample) | FLOP Scaling (training) |
|---|---|---|---|
| Width × $\alpha$ (all layers) | $\propto \alpha^2$ | $\propto \alpha^2$ | $\propto \alpha^2$ |
| Depth × $\alpha$ (add layers) | $\propto \alpha$ | $\propto \alpha$ | $\propto \alpha$ |
| Resolution × $\alpha$ (input size) | unchanged (CNN) | $\propto \alpha^2$ (spatial dimensions) | $\propto \alpha^2$ |
| Batch size × $\alpha$ | unchanged | unchanged (per sample) | unchanged (same total samples) |
| Epochs × $\alpha$ | unchanged | unchanged | $\propto \alpha$ |
Width is the most expensive dimension to scale — doubling all layer widths quadruples both parameters and FLOPs. Depth scaling is linear in both. Resolution scaling is quadratic in FLOPs but doesn't add parameters (for CNNs with weight sharing).
This is precisely why EfficientNet's compound scaling ($d = \alpha^\phi$, $w = \beta^\phi$, $r = \gamma^\phi$ with $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$) is designed to balance these different scaling rates.
For Transformer language models, test loss follows power laws in three variables:
$$L(N) \propto N^{-\alpha_N}, \qquad L(D) \propto D^{-\alpha_D}, \qquad L(C) \propto C^{-\alpha_C}$$where $N$ = model parameters, $D$ = dataset size (tokens), $C$ = training compute (FLOPs).
Empirically: $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, $\alpha_C \approx 0.050$.
Key findings:
(1) Performance improves predictably as a power law in compute — no surprises, no plateaus (within the studied range).
(2) Given a fixed compute budget $C$, the optimal allocation is to use a large model trained on a moderate amount of data, rather than a small model trained on a lot of data ("big model, few epochs" is more efficient).
(3) The compute-optimal model size scales as $N_{\text{opt}} \propto C^{0.73}$.
Updated the Kaplan findings with the "Chinchilla" result: for compute-optimal training, the model size $N$ and dataset size $D$ should be scaled equally:
$$N_{\text{opt}} \propto C^{0.5}, \qquad D_{\text{opt}} \propto C^{0.5}$$This means the optimal number of training tokens is approximately $D \approx 20N$ (20 tokens per parameter). Many large language models before Chinchilla were undertrained — they used too few tokens for their size.
Practical implication: Given a compute budget, the Chinchilla law tells you exactly how big your model should be and how much data you need. This converts the model design problem into a pure compute-allocation problem.
Given a compute budget of $C = 10^{21}$ FLOPs:
Chinchilla-optimal:
$N \approx \sqrt{C/6} \approx \sqrt{1.67 \times 10^{20}} \approx 12.9$ billion parameters
$D \approx 20N \approx 258$ billion tokens
$\text{Training FLOPs check: } 6ND = 6 \times 12.9 \times 10^9 \times 258 \times 10^9 \approx 2 \times 10^{22}$ FLOPs
(The factor of 6 in $C \approx 6ND$ is the approximate FLOP cost per token in a Transformer: 2 for forward pass × 3 for training.)
While the Kaplan and Chinchilla scaling laws were derived for Transformers, the fundamental principle applies to CNNs: for a fixed compute budget, there exists an optimal model size and training duration. Scaling laws provide a principled framework for answering "should I train a bigger model for fewer epochs or a smaller model for more epochs?" — a central question in computational efficiency research.
For CNNs specifically, the compound scaling principle of EfficientNet ($\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$) is a scaling law designed for the multi-dimensional CNN design space (depth × width × resolution).
[11] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
[12] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. (Chinchilla)
[13] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.
[14] Rosenfeld, J. S. et al. (2020). A Constructive Prediction of the Generalization Error Across Scales. ICLR 2020.
| Technique | Memory Savings | FLOP Overhead | Accuracy Impact | Complexity |
|---|---|---|---|---|
| Mixed Precision (FP16/BF16) | ~50% activations | ~0% (faster on tensor cores) | <0.1% | Low |
| Gradient Checkpointing | $O(L) \to O(\sqrt{L})$ activations | +33% | 0% | Low |
| Gradient Accumulation | $B_{\text{eff}} \to B_{\text{micro}}$ activations | 0% | 0% | Trivial |
| INT8 Quantization (inference) | 75% model + activations | 0% (faster on INT8 hw) | 0.1–0.5% | Medium |
| Pruning (structured) | Proportional to sparsity | Proportional reduction | 0.5–2% | High |
| Knowledge Distillation | Student model size | Student model FLOPs | 1–5% vs teacher | Medium |
| BN Fusion (inference) | Eliminates BN memory | Eliminates BN FLOPs | 0% | Trivial |
| Operator Fusion | Eliminates intermediates | 0% | 0% | Framework-level |
For any neural network training run, compute these five quantities:
1. Forward FLOPs per sample $F$: Sum FLOP formulas for all layers
2. Total training FLOPs: $C = 3 \times F \times N_{\text{samples}} \times E$
3. Parameter memory: $M_P = (4 + 4 + 4M_{\text{opt}}) \times P$ bytes
4. Activation memory: $M_A = \text{bytes\_per\_element} \times B \times \sum_l |A_l|$
5. Training time: $T = C / (\text{Hardware FLOP/s} \times \text{utilization})$
[1] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.
[2] Canziani, A. et al. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.
[3] Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.
[4] Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR 2018.
[5] Williams, S. et al. (2009). Roofline: An Insightful Visual Performance Model. Commun. ACM, 52(4), 65–76.
[6] Jia, Z. et al. (2019). Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826.
[7] Goyal, P. et al. (2017). Accurate, Large Minibatch SGD. arXiv:1706.02677.
[8] Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018.
[9] Han, S. et al. (2016). Deep Compression. ICLR 2016.
[10] Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NIPS Workshop 2015.
[11] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
[12] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022.
[13] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.
[14] Rosenfeld, J. S. et al. (2020). A Constructive Prediction of the Generalization Error Across Scales. ICLR 2020.