The neuron is the atomic unit of neural computation. Every MLP, CNN, RNN, and Transformer is ultimately built from neurons connected in various patterns. We begin by fully characterizing a single neuron's computation.
A neuron with $n$ inputs computes:
$$z = \sum_{i=1}^{n} w_i x_i + b = \bm{w}\T \bm{x} + b$$ $$a = \sigma(z)$$where $\bm{x} \in \R^n$ is the input vector, $\bm{w} \in \R^n$ is the weight vector, $b \in \R$ is the bias scalar, $z \in \R$ is the pre-activation (logit), $\sigma$ is the activation function, and $a \in \R$ is the output (activation).
| Metric | Value | Formula |
|---|---|---|
| Parameters | $n + 1$ | $n$ weights + 1 bias |
| Memory (params, float32) | $(n+1) \times 4$ bytes | |
| Forward FLOPs | $2n + 1 + c_\sigma$ | dot product + bias + activation |
| Backward FLOPs | $\approx 3n + c_{\sigma'}$ | $\pd{L}{w_i}$, $\pd{L}{x_i}$, $\pd{L}{b}$ |
| Memory for backprop | cache $z$ or $a$ | 1 scalar per neuron |
where $c_\sigma$ is the activation function cost (1 for ReLU, ~4 for sigmoid) and $c_{\sigma'}$ is its derivative cost.
The pre-activation $z = \bm{w}\T\bm{x} + b = 0$ defines a hyperplane in $\R^n$. The weight vector $\bm{w}$ is the normal to this hyperplane, and $b / \|\bm{w}\|$ is the signed distance from the origin to the hyperplane. The neuron's activation function then maps one side of the hyperplane to "active" and the other to "inactive" (in the case of a step function or ReLU).
A single neuron can therefore separate inputs that are linearly separable — those that can be divided by a hyperplane. This directly leads to its fundamental limitation.
Minsky and Papert (1969) proved that a single perceptron (neuron with a step activation) cannot learn the XOR function, because XOR is not linearly separable in $\R^2$. The solution requires at least one hidden layer — this motivated the development of the multilayer perceptron. The computational cost of solving XOR with an MLP is trivially small but conceptually profound: you need at minimum 2 hidden neurons plus 1 output neuron, with a total of 9 parameters.
[1] Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.
[2] Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
[3] Goodfellow et al. (2016). Deep Learning, Ch. 6.1: Example — Learning XOR.
An MLP with $L$ layers (not counting the input) is described by:
$n^{(0)}$ = input dimension (number of features)
$n^{(l)}$ = number of neurons in layer $l$, for $l = 1, 2, \ldots, L$
$\bm{W}^{(l)} \in \R^{n^{(l)} \times n^{(l-1)}}$ = weight matrix of layer $l$
$\bm{b}^{(l)} \in \R^{n^{(l)}}$ = bias vector of layer $l$
$\bm{z}^{(l)} \in \R^{n^{(l)}}$ = pre-activation at layer $l$: $\bm{z}^{(l)} = \bm{W}^{(l)}\bm{a}^{(l-1)} + \bm{b}^{(l)}$
$\bm{a}^{(l)} \in \R^{n^{(l)}}$ = activation at layer $l$: $\bm{a}^{(l)} = \sigma(\bm{z}^{(l)})$
$\bm{a}^{(0)} = \bm{x}$ = the input
$B$ = batch size
Layer $l$:
$$\text{params}^{(l)} = n^{(l)} \times n^{(l-1)} + n^{(l)}= n^{(l)}(n^{(l-1)} + 1)$$Total network:
$$P = \sum_{l=1}^{L} n^{(l)}(n^{(l-1)} + 1)$$Memory for all parameters (float32): $P \times 4$ bytes
We now derive the forward pass for a full batch of $B$ samples. The input is a matrix $\bm{X} \in \R^{B \times n^{(0)}}$ where each row is one sample.
Step 1 — Linear transformation:
$$\bm{Z}^{(l)} = \bm{A}^{(l-1)} \bm{W}^{(l)\top} + \bm{b}^{(l)}$$where $\bm{A}^{(l-1)} \in \R^{B \times n^{(l-1)}}$, $\bm{W}^{(l)} \in \R^{n^{(l)} \times n^{(l-1)}}$, and $\bm{b}^{(l)} \in \R^{n^{(l)}}$ is broadcast across the batch.
Result: $\bm{Z}^{(l)} \in \R^{B \times n^{(l)}}$
Step 2 — Activation:
$$\bm{A}^{(l)} = \sigma\!\left(\bm{Z}^{(l)}\right)$$Applied element-wise. Result: $\bm{A}^{(l)} \in \R^{B \times n^{(l)}}$
Initialization: $\bm{A}^{(0)} = \bm{X}$
| Sub-operation | Computation | FLOPs | Output Shape |
|---|---|---|---|
| Matrix multiply | $\bm{A}^{(l-1)} \bm{W}^{(l)\top}$ | $2 \cdot B \cdot n^{(l-1)} \cdot n^{(l)}$ | $B \times n^{(l)}$ |
| Bias addition | $+ \bm{b}^{(l)}$ (broadcast) | $B \cdot n^{(l)}$ | $B \times n^{(l)}$ |
| Activation | $\sigma(\bm{Z}^{(l)})$ | $c_\sigma \cdot B \cdot n^{(l)}$ | $B \times n^{(l)}$ |
| Total per layer | $(2n^{(l-1)} + 1 + c_\sigma) \cdot B \cdot n^{(l)}$ |
The matrix multiply dominates: for $n^{(l-1)} \gg 1$, the total is approximately $2Bn^{(l-1)}n^{(l)}$ FLOPs.
Memory to cache for backprop: $\bm{A}^{(l-1)}$ (input to layer) and $\bm{Z}^{(l)}$ or $\bm{A}^{(l)}$ (for activation derivative). Total: $B \cdot n^{(l-1)} + B \cdot n^{(l)}$ elements per layer.
Memory (activations for backprop):
$$\text{Activation memory} = B \sum_{l=0}^{L} n^{(l)} \times \text{bytes\_per\_element}$$This is often the dominant memory cost during training, exceeding the parameter memory for large batch sizes.
This is the heart of neural network training. We derive every gradient from first principles using the chain rule, tracking the exact computational cost of each step.
Backpropagation starts from the loss. For cross-entropy loss with softmax output:
$$\pd{L}{\bm{Z}^{(L)}} = \frac{1}{B}(\hat{\bm{Y}} - \bm{Y}) \in \R^{B \times n^{(L)}}$$Cost: $B \cdot n^{(L)}$ subtractions + $B \cdot n^{(L)}$ divisions by $B$ = $2B \cdot n^{(L)}$ FLOPs. (Often the $1/B$ is folded into the learning rate.)
At layer $l$, we receive the "upstream gradient" $\bm{\delta}^{(l)} = \pd{L}{\bm{Z}^{(l)}} \in \R^{B \times n^{(l)}}$ from the layer above. We need to compute three things:
(a) Gradient with respect to weights $\bm{W}^{(l)}$:
Since $\bm{Z}^{(l)} = \bm{A}^{(l-1)}\bm{W}^{(l)\top} + \bm{b}^{(l)}$, by the chain rule:
$$\pd{L}{\bm{W}^{(l)}} = \frac{1}{B}\;\bm{\delta}^{(l)\top} \bm{A}^{(l-1)} \in \R^{n^{(l)} \times n^{(l-1)}}$$This is a matrix multiplication of $\bm{\delta}^{(l)\top}$ (shape $n^{(l)} \times B$) with $\bm{A}^{(l-1)}$ (shape $B \times n^{(l-1)}$).
Cost: $2 \cdot n^{(l)} \cdot B \cdot n^{(l-1)}$ FLOPs — same order as the forward pass matmul.
(b) Gradient with respect to biases $\bm{b}^{(l)}$:
$$\pd{L}{\bm{b}^{(l)}} = \frac{1}{B}\sum_{i=1}^{B} \bm{\delta}_i^{(l)} = \frac{1}{B}\;\mathbf{1}\T \bm{\delta}^{(l)} \in \R^{n^{(l)}}$$This is a sum over the batch dimension.
Cost: $B \cdot n^{(l)}$ additions = $B \cdot n^{(l)}$ FLOPs.
(c) Gradient with respect to input $\bm{A}^{(l-1)}$ (to propagate backward):
$$\pd{L}{\bm{A}^{(l-1)}} = \bm{\delta}^{(l)} \bm{W}^{(l)} \in \R^{B \times n^{(l-1)}}$$Matrix multiplication of $\bm{\delta}^{(l)}$ (shape $B \times n^{(l)}$) with $\bm{W}^{(l)}$ (shape $n^{(l)} \times n^{(l-1)}$).
Cost: $2 \cdot B \cdot n^{(l)} \cdot n^{(l-1)}$ FLOPs — same order as the forward pass matmul.
(d) Gradient through activation function (for the next layer down):
$$\bm{\delta}^{(l-1)} = \pd{L}{\bm{Z}^{(l-1)}} = \pd{L}{\bm{A}^{(l-1)}} \odot \sigma'\!\left(\bm{Z}^{(l-1)}\right)$$Element-wise (Hadamard) product with the activation derivative.
Cost: $B \cdot n^{(l-1)}$ multiplications (+ cost of computing $\sigma'$) = $(1 + c_{\sigma'}) \cdot B \cdot n^{(l-1)}$ FLOPs.
| Step | Operation | FLOPs |
|---|---|---|
| (a) $\pd{L}{\bm{W}^{(l)}}$ | $\bm{\delta}^{(l)\top}\bm{A}^{(l-1)} / B$ | $2 B n^{(l)} n^{(l-1)}$ |
| (b) $\pd{L}{\bm{b}^{(l)}}$ | $\mathbf{1}\T\bm{\delta}^{(l)} / B$ | $B n^{(l)}$ |
| (c) $\pd{L}{\bm{A}^{(l-1)}}$ | $\bm{\delta}^{(l)} \bm{W}^{(l)}$ | $2 B n^{(l)} n^{(l-1)}$ |
| (d) $\bm{\delta}^{(l-1)}$ | $\pd{L}{\bm{A}^{(l-1)}} \odot \sigma'(\bm{Z}^{(l-1)})$ | $(1+c_{\sigma'}) B n^{(l-1)}$ |
| Total | $\approx\mathbf{4Bn^{(l)}n^{(l-1)}}$ (matmul-dominated) |
Compare to the forward pass for layer $l$: $\approx 2Bn^{(l-1)}n^{(l)}$ FLOPs. The backward pass is indeed 2× the forward pass, as predicted by AD theory (two matrix multiplications vs. one).
Or equivalently: $\text{FLOPs}_{\text{total}} \approx 3 \times \text{FLOPs}_{\text{forward}}$
Peak memory during training (float32):
$$\text{Memory} = \underbrace{4P}_{\text{params}} + \underbrace{4P}_{\text{gradients}} + \underbrace{4 \cdot M_{\text{opt}} \cdot P}_{\text{optimizer}} + \underbrace{4B \sum_{l=0}^{L} n^{(l)}}_{\text{cached activations}}$$where $M_{\text{opt}} = 0$ for SGD, $1$ for SGD+momentum, $2$ for Adam.
Let us now compute every single metric for a concrete network. We consider a 3-layer MLP for MNIST digit classification:
Architecture: 784 → 256 → 128 → 10
$n^{(0)} = 784$ (28×28 pixels), $n^{(1)} = 256$, $n^{(2)} = 128$, $n^{(3)} = 10$ (digits 0-9)
Activation: ReLU for hidden layers, Softmax for output
Loss: Cross-entropy
Batch size: $B = 32$
| Layer | Weight Shape | Weights | Biases | Total Params |
|---|---|---|---|---|
| Layer 1 (784→256) | 256 × 784 | 200,704 | 256 | 200,960 |
| Layer 2 (256→128) | 128 × 256 | 32,768 | 128 | 32,896 |
| Layer 3 (128→10) | 10 × 128 | 1,280 | 10 | 1,290 |
| Total | 234,752 | 394 | 235,146 |
Parameter memory (float32): $235{,}146 \times 4 = 940{,}584$ bytes $\approx$ 0.94 MB
| Layer | MatMul | Bias | Activation | Layer Total |
|---|---|---|---|---|
| Layer 1 | $2 \times 32 \times 784 \times 256 = 12{,}845{,}056$ | $32 \times 256 = 8{,}192$ | $32 \times 256 = 8{,}192$ (ReLU) | 12,861,440 |
| Layer 2 | $2 \times 32 \times 256 \times 128 = 2{,}097{,}152$ | $32 \times 128 = 4{,}096$ | $32 \times 128 = 4{,}096$ (ReLU) | 2,105,344 |
| Layer 3 | $2 \times 32 \times 128 \times 10 = 81{,}920$ | $32 \times 10 = 320$ | $5 \times 32 \times 10 = 1{,}600$ (Softmax) | 83,840 |
| Loss | Cross-entropy: $32 \times 10 \times 2 = 640$ FLOPs | 640 | ||
| Total Forward | 15,051,264 ≈ 15.1M FLOPs | |||
| Layer | (a) ∂L/∂W | (b) ∂L/∂b | (c) ∂L/∂A | (d) δ element-wise | Layer Total |
|---|---|---|---|---|---|
| Layer 3 | $2 \times 10 \times 32 \times 128$ = 81,920 |
$32 \times 10$ = 320 |
$2 \times 32 \times 10 \times 128$ = 81,920 |
$32 \times 128$ = 4,096 |
168,256 |
| Layer 2 | $2 \times 128 \times 32 \times 256$ = 2,097,152 |
$32 \times 128$ = 4,096 |
$2 \times 32 \times 128 \times 256$ = 2,097,152 |
$32 \times 256$ = 8,192 |
4,206,592 |
| Layer 1 | $2 \times 256 \times 32 \times 784$ = 12,845,056 |
$32 \times 256$ = 8,192 |
$2 \times 32 \times 256 \times 784$ = 12,845,056 |
— (input layer) | 25,698,304 |
| Total Backward | 30,073,152 ≈ 30.1M FLOPs | ||||
Forward: 15.1M FLOPs. Backward: 30.1M FLOPs. Ratio: $30.1 / 15.1 \approx 2.0$
Total: $15.1 + 30.1 = 45.2$M FLOPs $\approx 3 \times 15.1$M ✓
This confirms the theoretical prediction that total training cost ≈ 3× forward pass.
| Component | Elements | Bytes | MB |
|---|---|---|---|
| Parameters | 235,146 | 940,584 | 0.90 |
| Gradients (same size as params) | 235,146 | 940,584 | 0.90 |
| Adam optimizer state (m + v) | $2 \times 235{,}146 = 470{,}292$ | 1,881,168 | 1.79 |
| Cached activations: $\bm{A}^{(0)}$ (input) | $32 \times 784 = 25{,}088$ | 100,352 | 0.10 |
| Cached activations: $\bm{Z}^{(1)}$ or $\bm{A}^{(1)}$ | $32 \times 256 = 8{,}192$ | 32,768 | 0.03 |
| Cached activations: $\bm{Z}^{(2)}$ or $\bm{A}^{(2)}$ | $32 \times 128 = 4{,}096$ | 16,384 | 0.02 |
| Cached activations: $\bm{Z}^{(3)}$ (output logits) | $32 \times 10 = 320$ | 1,280 | ~0 |
| Total | ≈ 3.74 MB |
(1) Layer 1 dominates everything: it has 85.4% of all parameters and accounts for 85.3% of forward FLOPs and 85.5% of backward FLOPs. This is because it connects the high-dimensional input (784) to a wide hidden layer (256). In practice, the first layer of an MLP processing raw pixels is the bottleneck — this is precisely the problem that CNNs solve by using local connectivity.
(2) Adam triples the parameter memory: storing $\bm{m}$ and $\bm{v}$ alongside the parameters means $3 \times$ the parameter memory (params + gradients + 2× optimizer state). For this small network it's negligible, but for billion-parameter models, this becomes a major constraint.
(3) Activation memory is small here because the batch size (32) and hidden sizes (256, 128) are modest. For larger networks and batch sizes, activation memory can far exceed parameter memory.
[3] Goodfellow et al. (2016). Deep Learning, Ch. 6: Deep Feedforward Networks. — The canonical reference for MLP forward/backward computation.
[4] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
[5] Nielsen, M. (2015). Neural Networks and Deep Learning, Ch. 2: How the backpropagation algorithm works. Online: neuralnetworksanddeeplearning.com.
[6] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.
Activation functions introduce the non-linearity that makes neural networks more powerful than a single linear transformation. Without them, any MLP would collapse to a single matrix multiplication regardless of depth: $\bm{W}^{(L)}\bm{W}^{(L-1)}\cdots\bm{W}^{(1)}\bm{x} = \bm{W}_{\text{equiv}}\bm{x}$. Each activation has distinct computational costs, gradient properties, and implications for training dynamics.
Output range: $(0, 1)$
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$
| Metric | Value |
|---|---|
| Forward: $\sigma(z) = 1/(1+e^{-z})$ | 1 negation + 1 exp + 1 add + 1 division ≈ 4 FLOPs/element |
| Backward: $\sigma'(z) = \sigma(z)(1-\sigma(z))$ | 1 subtraction + 1 multiplication ≈ 2 FLOPs/element (if $\sigma(z)$ cached) |
| Memory for backprop | Must cache $\sigma(z)$ = $\bm{a}^{(l)}$ — 1 float per element |
| Gradient property | $\sigma'(z) \in (0, 0.25]$ — always shrinks gradients |
| Output centering | Output $\in (0,1)$, mean ≈ 0.5 — not zero-centered |
After $L$ layers with sigmoid activations, the gradient at layer 1 is multiplied by the product $\prod_{l=1}^{L}\sigma'(z^{(l)})$. Since each $\sigma' \leq 0.25$:
$$\left|\prod_{l=1}^{L} \sigma'(z^{(l)})\right| \leq 0.25^L$$For $L = 10$: $0.25^{10} \approx 10^{-6}$. For $L = 20$: $0.25^{20} \approx 10^{-12}$. Gradients vanish exponentially, making deep networks with sigmoid essentially untrainable. This was the key obstacle identified by Hochreiter (1991) and Bengio et al. (1994).
Output range: $(-1, 1)$ — zero-centered ✓
Derivative: $\tanh'(z) = 1 - \tanh^2(z)$
| Metric | Value |
|---|---|
| Forward FLOPs | 2 exp + 1 subtraction + 1 addition + 1 division ≈ 6 FLOPs/element |
| Backward FLOPs | 1 square + 1 subtraction = 2 FLOPs/element (if $\tanh(z)$ cached) |
| Max derivative | $\tanh'(0) = 1$ — better than sigmoid's 0.25 |
| Gradient property | $\tanh'(z) \in (0, 1]$ — still saturates for large $|z|$ |
Tanh is more expensive than sigmoid in the forward pass (6 vs 4 FLOPs) but has better gradient flow properties (max derivative 1 vs 0.25). The zero-centered output eliminates the zig-zag gradient update problem of sigmoid. Despite this, it still suffers from vanishing gradients in deep networks.
Output range: $[0, \infty)$
Derivative: $\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases}$ (undefined at $z=0$, typically set to 0)
| Metric | Value |
|---|---|
| Forward FLOPs | 1 comparison + 1 conditional select ≈ 1 FLOP/element |
| Backward FLOPs | 1 comparison ≈ 1 FLOP/element |
| Memory for backprop | Only need sign of $z$: 1 bit per element (theoretically) |
| Gradient property | Exactly 1 for $z > 0$, exactly 0 for $z < 0$ — no shrinkage, no saturation |
| Sparsity | ~50% of outputs are zero (for zero-mean inputs) — natural sparsity |
Comparing to sigmoid and tanh: ReLU is 4–6× cheaper in the forward pass and requires drastically less memory for backprop (1 bit vs 32 bits to cache). But the primary advantage is gradient flow: the derivative is exactly 1 for positive inputs, meaning gradients propagate through ReLU layers without any decay. For a network with $L$ ReLU layers where all pre-activations are positive:
$$\prod_{l=1}^{L} \text{ReLU}'(z^{(l)}) = 1^L = 1$$No vanishing gradient, regardless of depth. This is the key insight of Nair & Hinton (2010) and the reason ReLU enabled training of much deeper networks.
If a neuron's pre-activation $z$ becomes permanently negative (due to a large negative bias or an unlucky weight update), its output and gradient are permanently zero. The neuron is "dead" — it will never be updated again. In practice, 10–40% of neurons can die during training. This motivates the variants below.
Leaky ReLU: $\alpha$ is a fixed small constant (typically 0.01)
PReLU: $\alpha$ is a learnable parameter (one per channel or one global)
Derivative: $f'(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \leq 0 \end{cases}$
| Metric | Value |
|---|---|
| Forward FLOPs | 1 comparison + 1 conditional multiply ≈ 2 FLOPs/element |
| Backward FLOPs | 1 comparison + 1 conditional ≈ 2 FLOPs/element |
| Extra parameters (PReLU) | 1 per channel (if per-channel) or 1 total (if shared) |
| Gradient for $z < 0$ | $\alpha \neq 0$ — no dead neurons |
The small slope $\alpha$ for negative inputs ensures gradients always flow, fixing the dying ReLU problem at the cost of 1 extra FLOP per element. The overall computational overhead compared to ReLU is negligible.
Forward cost: 1 comparison + (conditionally) 1 exp + 1 subtraction + 1 multiplication ≈ 2–4 FLOPs/element
Advantage: Smooth for $z < 0$, zero-centered mean activations, no dying neurons.
Clevert, Unterthiner & Hochreiter (2016), "Fast and Accurate Deep Network Learning by Exponential Linear Units"
Practical approximation:
$$\text{GELU}(z) \approx 0.5z\left(1 + \tanh\!\left[\sqrt{2/\pi}\left(z + 0.044715z^3\right)\right]\right)$$Forward cost (approximate form): 1 cube + 2 multiplications + 1 addition + 1 tanh (~6 FLOPs) + 1 addition + 1 multiplication ≈ 12–14 FLOPs/element
Used in: BERT, GPT, most modern Transformers
Hendrycks & Gimpel (2016), "Gaussian Error Linear Units (GELUs)"
Derivative: $f'(z) = \sigma(z) + z \cdot \sigma(z)(1 - \sigma(z)) = f(z) + \sigma(z)(1 - f(z))$
Forward cost: 1 sigmoid (~4 FLOPs) + 1 multiplication ≈ 5 FLOPs/element
Backward cost: ≈ 5 FLOPs/element
Ramachandran, Zoph & Le (2017), "Searching for Activation Functions" — found via automated search
Softmax is not technically an activation function for hidden layers — it is used at the output layer to convert logits into a probability distribution. Its full computational analysis was provided in Part I §3.2. Key recap:
Forward: $\sim 5C$ FLOPs. When fused with cross-entropy, the backward gradient simplifies to $\hat{\bm{y}} - \bm{y}$, costing only $C$ FLOPs.
| Activation | Forward FLOPs | Backward FLOPs | Cache Need | Max |f'| | Vanishing? | Dead Neurons? |
|---|---|---|---|---|---|---|
| Sigmoid | 4 | 2 | $a$ (32 bit) | 0.25 | Yes ✗ | No |
| Tanh | 6 | 2 | $a$ (32 bit) | 1.0 | Yes ✗ | No |
| ReLU | 1 | 1 | sign (1 bit) | 1.0 | No ✓ | Yes ✗ |
| Leaky ReLU | 2 | 2 | sign (1 bit) | 1.0 | No ✓ | No ✓ |
| PReLU | 2 | 2 | $z$ (32 bit) | 1.0 | No ✓ | No ✓ |
| ELU | 2–4 | 2–3 | $z$ or $a$ (32 bit) | 1.0 | No ✓ | No ✓ |
| GELU | 12–14 | 12–14 | $z$ (32 bit) | ~1.08 | No ✓ | No ✓ |
| Swish/SiLU | 5 | 5 | $z$, $\sigma(z)$ (64 bit) | ~1.10 | No ✓ | No ✓ |
For computational efficiency research, the key trade-off is: ReLU is the cheapest by a wide margin (1 FLOP, 1 bit memory), but GELU and Swish often yield slightly better accuracy at 5–14× the computational cost per element. Since activation cost is typically 1–2% of total layer cost (dominated by matrix multiplication), this difference rarely matters for overall FLOP count. The gradient flow properties (avoiding vanishing gradients) have a much larger impact on total training cost because they affect the number of iterations required to converge.
[7] Nair, V. & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML 2010.
[8] Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. AISTATS 2011.
[9] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. — PReLU and He initialization.
[10] Clevert, D., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units. ICLR 2016.
[11] Hendrycks, D. & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
[12] Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941.
[13] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich. — First identification of vanishing gradient.
[14] Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Trans. Neural Networks, 5(2), 157–166.
[3] Goodfellow et al. (2016). Deep Learning, Ch. 6.3: Hidden Units.
Weight initialization determines the starting point of optimization. A poor initialization can make the difference between a network that converges in 10 epochs and one that never converges at all — directly translating to orders of magnitude difference in total training FLOPs.
The fundamental question is: if the input to a layer has variance $\text{Var}(x)$, what is the variance of the output $\text{Var}(z)$ after the linear transformation $z_j = \sum_{i=1}^{n_{\text{in}}} w_{ji} x_i$?
Assuming inputs $x_i$ are i.i.d. with zero mean and variance $\text{Var}(x)$, and weights $w_{ji}$ are i.i.d. with zero mean and variance $\text{Var}(w)$, and inputs and weights are independent:
$$\text{Var}(z_j) = \text{Var}\!\left(\sum_{i=1}^{n_{\text{in}}} w_{ji} x_i\right) = \sum_{i=1}^{n_{\text{in}}} \text{Var}(w_{ji}) \cdot \text{Var}(x_i) = n_{\text{in}} \cdot \text{Var}(w) \cdot \text{Var}(x)$$For the variance to be preserved ($\text{Var}(z) = \text{Var}(x)$), we need:
$$\boxed{n_{\text{in}} \cdot \text{Var}(w) = 1 \quad \Longrightarrow \quad \text{Var}(w) = \frac{1}{n_{\text{in}}}}$$If $\text{Var}(w) > 1/n_{\text{in}}$, the variance grows at each layer: after $L$ layers, $\text{Var}(z) \propto (n \cdot \text{Var}(w))^L \to \infty$. Activations explode. If $\text{Var}(w) < 1/n_{\text{in}}$, the variance shrinks: $\text{Var}(z) \to 0$. Activations vanish. Both situations destroy training.
Designed for sigmoid and tanh activations (assuming the linear regime near zero where $\sigma'(0) \approx 1$):
Uniform:
$$w_{ij} \sim U\!\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},\; \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$$Normal:
$$w_{ij} \sim \mathcal{N}\!\left(0,\; \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$$The $n_{\text{in}} + n_{\text{out}}$ denominator is a compromise: preserving variance in both the forward pass (which wants $1/n_{\text{in}}$) and the backward pass (which wants $1/n_{\text{out}}$). The harmonic mean $2/(n_{\text{in}} + n_{\text{out}})$ balances both.
Designed for ReLU activations. ReLU zeros out approximately half of the outputs (those with $z \leq 0$), effectively halving the variance. To compensate:
$$w_{ij} \sim \mathcal{N}\!\left(0,\; \frac{2}{n_{\text{in}}}\right)$$or equivalently, $\text{Var}(w) = 2/n_{\text{in}}$ — twice the Xavier variance to account for the factor-of-2 reduction from ReLU's zeroing.
Initialization is a one-time cost at the start of training:
| Operation | Cost |
|---|---|
| Generate $P$ random numbers | $O(P)$ operations |
| Scale to desired variance | $O(P)$ multiplications |
| Total | $O(P)$ — negligible vs. training cost |
However, the indirect computational cost is enormous: poor initialization leads to slow or failed convergence, potentially wasting millions of times more FLOPs in unnecessary training iterations.
For our MNIST network (784→256→128→10):
Layer 1 (784→256), He init: $\text{Var}(w) = 2/784 = 0.00255$, so $\text{std}(w) = 0.0505$
Layer 1 (784→256), Xavier init: $\text{Var}(w) = 2/1040 = 0.00192$, so $\text{std}(w) = 0.0438$
Layer 2 (256→128), He init: $\text{Var}(w) = 2/256 = 0.00781$, so $\text{std}(w) = 0.0884$
Layer 3 (128→10), He init: $\text{Var}(w) = 2/128 = 0.01563$, so $\text{std}(w) = 0.125$
Narrower layers (more input connections) require smaller individual weights to prevent variance explosion.
[15] Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010.
[9] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers. ICCV 2015.
[16] Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR 2014.
[3] Goodfellow et al. (2016). Deep Learning, Ch. 8.4: Parameter Initialization Strategies.
| Metric | Formula | MNIST Example (784→256, B=32) |
|---|---|---|
| Parameters | $n^{(l)}(n^{(l-1)} + 1)$ | $256 \times 785 = 200{,}960$ |
| Param memory (float32) | $\text{params} \times 4$ bytes | 803,840 bytes = 0.77 MB |
| Forward FLOPs (matmul) | $2Bn^{(l-1)}n^{(l)}$ | $2 \times 32 \times 784 \times 256 = 12.8$M |
| Forward FLOPs (total) | $(2n^{(l-1)} + 1 + c_\sigma)Bn^{(l)}$ | ≈ 12.9M |
| Backward FLOPs | $\approx 4Bn^{(l-1)}n^{(l)}$ | ≈ 25.7M |
| Forward + backward | $\approx 6Bn^{(l-1)}n^{(l)}$ | ≈ 38.5M |
| Cached activations | $Bn^{(l-1)} + Bn^{(l)}$ elements | $(32 \times 784) + (32 \times 256) = 33{,}280$ elements |
| Cached activation memory | $\text{elements} \times 4$ bytes | 133,120 bytes = 0.13 MB |
| Component | Size (elements) | Depends On |
|---|---|---|
| Parameters $\bm{\theta}$ | $P$ | Architecture only |
| Gradients $\nabla\bm{\theta}$ | $P$ | Architecture only |
| Adam first moment $\bm{m}$ | $P$ | Architecture only |
| Adam second moment $\bm{v}$ | $P$ | Architecture only |
| Cached activations | $B\sum_{l=0}^{L}n^{(l)}$ | Architecture + batch size |
| Temporary gradient tensors | $\max_l(Bn^{(l)})$ | Largest layer × batch |
| Total (Adam) | $\mathbf{4P + B\sum n^{(l)} + \text{temp}}$ |
| Metric | Value |
|---|---|
| Architecture | 784 → 256 (ReLU) → 128 (ReLU) → 10 (Softmax) |
| Total parameters | 235,146 |
| Parameter memory | 0.90 MB |
| Total training memory (Adam, B=32) | 3.74 MB |
| Forward FLOPs (B=32) | 15.1M |
| Backward FLOPs (B=32) | 30.1M |
| Total per-step FLOPs | 45.2M |
| FLOPs per sample (forward) | ~471K |
| FLOPs per epoch (60K samples, B=32) | $1{,}875 \times 45.2\text{M} \approx 84.7$G FLOPs |
| Dominant cost center | Layer 1 (85% of all FLOPs) |
[1] Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.
[2] Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
[4] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
[7] Nair, V. & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML 2010.
[8] Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. AISTATS 2011.
[10] Clevert, D., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units. ICLR 2016.
[11] Hendrycks, D. & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
[12] Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941.
[13] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich.
[14] Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Trans. Neural Networks, 5(2), 157–166.
[9] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015.
[15] Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010.
[16] Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR 2014.
[3] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Ch. 6 (Deep Feedforward Networks), Ch. 8.4 (Initialization).
[5] Nielsen, M. (2015). Neural Networks and Deep Learning. Online:
neuralnetworksanddeeplearning.com. Ch. 1–2.
[6] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, 105(12), 2295–2329.