Part II of VI

The Multilayer Perceptron — Complete Dissection

Every neuron, every layer, every matrix multiplication, every gradient — derived from first principles with full computational cost accounting

4. The Single Neuron (Perceptron)
4.1 Mathematical Model
4.2 Geometric Interpretation
4.3 Limitations — The XOR Problem
5. MLP Architecture — Layer-by-Layer Analysis
5.1 Network Topology and Notation
5.2 Forward Pass — Complete Derivation
5.3 Backward Pass (Backpropagation) — Complete Derivation
5.4 Complete Worked Example: MNIST Network
6. Activation Functions — Detailed Analysis
6.1 Sigmoid
6.2 Tanh
6.3 ReLU
6.4 Leaky ReLU & PReLU
6.5 ELU, GELU, Swish/SiLU
6.6 Softmax (Output Layer)
6.7 Comparative Summary
7. Weight Initialization
7.1 The Variance Propagation Problem
7.2 Xavier/Glorot Initialization
7.3 He/Kaiming Initialization
Computational Summary Tables
Complete References

§4 The Single Neuron (Perceptron)

The neuron is the atomic unit of neural computation. Every MLP, CNN, RNN, and Transformer is ultimately built from neurons connected in various patterns. We begin by fully characterizing a single neuron's computation.

4.1 Mathematical Model

Single Neuron Computation

A neuron with $n$ inputs computes:

$$z = \sum_{i=1}^{n} w_i x_i + b = \bm{w}\T \bm{x} + b$$ $$a = \sigma(z)$$

where $\bm{x} \in \R^n$ is the input vector, $\bm{w} \in \R^n$ is the weight vector, $b \in \R$ is the bias scalar, $z \in \R$ is the pre-activation (logit), $\sigma$ is the activation function, and $a \in \R$ is the output (activation).

Figure 4.1. A single neuron with $n=3$ inputs. The dot product $\bm{w}\T\bm{x}$ costs $2n$ FLOPs, the bias adds 1, and the activation function adds 1–10 depending on type. Total: approximately $2n+2$ FLOPs per neuron.

Single Neuron — Complete Cost

Metric	Value	Formula
Parameters	$n + 1$	$n$ weights + 1 bias
Memory (params, float32)	$(n+1) \times 4$ bytes
Forward FLOPs	$2n + 1 + c_\sigma$	dot product + bias + activation
Backward FLOPs	$\approx 3n + c_{\sigma'}$	$\pd{L}{w_i}$, $\pd{L}{x_i}$, $\pd{L}{b}$
Memory for backprop	cache $z$ or $a$	1 scalar per neuron

where $c_\sigma$ is the activation function cost (1 for ReLU, ~4 for sigmoid) and $c_{\sigma'}$ is its derivative cost.

4.2 Geometric Interpretation

The pre-activation $z = \bm{w}\T\bm{x} + b = 0$ defines a hyperplane in $\R^n$. The weight vector $\bm{w}$ is the normal to this hyperplane, and $b / \|\bm{w}\|$ is the signed distance from the origin to the hyperplane. The neuron's activation function then maps one side of the hyperplane to "active" and the other to "inactive" (in the case of a step function or ReLU).

A single neuron can therefore separate inputs that are linearly separable — those that can be divided by a hyperplane. This directly leads to its fundamental limitation.

4.3 Limitations — The XOR Problem

Minsky and Papert (1969) proved that a single perceptron (neuron with a step activation) cannot learn the XOR function, because XOR is not linearly separable in $\R^2$. The solution requires at least one hidden layer — this motivated the development of the multilayer perceptron. The computational cost of solving XOR with an MLP is trivially small but conceptually profound: you need at minimum 2 hidden neurons plus 1 output neuron, with a total of 9 parameters.

References for §4

[1] Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.

[2] Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.

[3] Goodfellow et al. (2016). Deep Learning, Ch. 6.1: Example — Learning XOR.

§5 MLP Architecture — Layer-by-Layer Analysis

5.1 Network Topology and Notation

Standard MLP Notation

An MLP with $L$ layers (not counting the input) is described by:

$n^{(0)}$ = input dimension (number of features)

$n^{(l)}$ = number of neurons in layer $l$, for $l = 1, 2, \ldots, L$

$\bm{W}^{(l)} \in \R^{n^{(l)} \times n^{(l-1)}}$ = weight matrix of layer $l$

$\bm{b}^{(l)} \in \R^{n^{(l)}}$ = bias vector of layer $l$

$\bm{z}^{(l)} \in \R^{n^{(l)}}$ = pre-activation at layer $l$: $\bm{z}^{(l)} = \bm{W}^{(l)}\bm{a}^{(l-1)} + \bm{b}^{(l)}$

$\bm{a}^{(l)} \in \R^{n^{(l)}}$ = activation at layer $l$: $\bm{a}^{(l)} = \sigma(\bm{z}^{(l)})$

$\bm{a}^{(0)} = \bm{x}$ = the input

$B$ = batch size

Parameter Count — Per Layer and Total

Layer $l$:

$$\text{params}^{(l)} = n^{(l)} \times n^{(l-1)} + n^{(l)}= n^{(l)}(n^{(l-1)} + 1)$$

Total network:

$$P = \sum_{l=1}^{L} n^{(l)}(n^{(l-1)} + 1)$$

Memory for all parameters (float32): $P \times 4$ bytes

Figure 5.1. A 3-layer MLP with architecture 4→5→3→2. Every neuron in one layer connects to every neuron in the next (fully connected). Total parameters: 25 + 18 + 8 = 51. Every line represents one learnable weight.

5.2 Forward Pass — Complete Derivation

We now derive the forward pass for a full batch of $B$ samples. The input is a matrix $\bm{X} \in \R^{B \times n^{(0)}}$ where each row is one sample.

Forward Pass — Layer $l$ (Batched)

Step 1 — Linear transformation:

$$\bm{Z}^{(l)} = \bm{A}^{(l-1)} \bm{W}^{(l)\top} + \bm{b}^{(l)}$$

where $\bm{A}^{(l-1)} \in \R^{B \times n^{(l-1)}}$, $\bm{W}^{(l)} \in \R^{n^{(l)} \times n^{(l-1)}}$, and $\bm{b}^{(l)} \in \R^{n^{(l)}}$ is broadcast across the batch.

Result: $\bm{Z}^{(l)} \in \R^{B \times n^{(l)}}$

Step 2 — Activation:

$$\bm{A}^{(l)} = \sigma\!\left(\bm{Z}^{(l)}\right)$$

Applied element-wise. Result: $\bm{A}^{(l)} \in \R^{B \times n^{(l)}}$

Initialization: $\bm{A}^{(0)} = \bm{X}$

Forward Pass — Computational Cost for Layer $l$

Sub-operation	Computation	FLOPs	Output Shape
Matrix multiply	$\bm{A}^{(l-1)} \bm{W}^{(l)\top}$	$2 \cdot B \cdot n^{(l-1)} \cdot n^{(l)}$	$B \times n^{(l)}$
Bias addition	$+ \bm{b}^{(l)}$ (broadcast)	$B \cdot n^{(l)}$	$B \times n^{(l)}$
Activation	$\sigma(\bm{Z}^{(l)})$	$c_\sigma \cdot B \cdot n^{(l)}$	$B \times n^{(l)}$
Total per layer		$(2n^{(l-1)} + 1 + c_\sigma) \cdot B \cdot n^{(l)}$

The matrix multiply dominates: for $n^{(l-1)} \gg 1$, the total is approximately $2Bn^{(l-1)}n^{(l)}$ FLOPs.

Memory to cache for backprop: $\bm{A}^{(l-1)}$ (input to layer) and $\bm{Z}^{(l)}$ or $\bm{A}^{(l)}$ (for activation derivative). Total: $B \cdot n^{(l-1)} + B \cdot n^{(l)}$ elements per layer.

Total Forward Pass — Full Network

$$\text{FLOPs}_{\text{forward}} = \sum_{l=1}^{L} 2 B \cdot n^{(l-1)} \cdot n^{(l)} + \text{(bias + activation costs)}$$ $$\text{FLOPs}_{\text{forward}} \approx 2B \sum_{l=1}^{L} n^{(l-1)} \cdot n^{(l)}$$

Memory (activations for backprop):

$$\text{Activation memory} = B \sum_{l=0}^{L} n^{(l)} \times \text{bytes\_per\_element}$$

This is often the dominant memory cost during training, exceeding the parameter memory for large batch sizes.

5.3 Backward Pass (Backpropagation) — Complete Derivation

This is the heart of neural network training. We derive every gradient from first principles using the chain rule, tracking the exact computational cost of each step.

Step 0: Loss Gradient (Starting Point)

Backpropagation starts from the loss. For cross-entropy loss with softmax output:

$$\pd{L}{\bm{Z}^{(L)}} = \frac{1}{B}(\hat{\bm{Y}} - \bm{Y}) \in \R^{B \times n^{(L)}}$$

Cost: $B \cdot n^{(L)}$ subtractions + $B \cdot n^{(L)}$ divisions by $B$ = $2B \cdot n^{(L)}$ FLOPs. (Often the $1/B$ is folded into the learning rate.)

Step 1: Gradient Through Each Layer (from $l = L$ down to $l = 1$)

At layer $l$, we receive the "upstream gradient" $\bm{\delta}^{(l)} = \pd{L}{\bm{Z}^{(l)}} \in \R^{B \times n^{(l)}}$ from the layer above. We need to compute three things:

Backpropagation — Layer $l$ Gradients (Full Derivation)

(a) Gradient with respect to weights $\bm{W}^{(l)}$:

Since $\bm{Z}^{(l)} = \bm{A}^{(l-1)}\bm{W}^{(l)\top} + \bm{b}^{(l)}$, by the chain rule:

$$\pd{L}{\bm{W}^{(l)}} = \frac{1}{B}\;\bm{\delta}^{(l)\top} \bm{A}^{(l-1)} \in \R^{n^{(l)} \times n^{(l-1)}}$$

This is a matrix multiplication of $\bm{\delta}^{(l)\top}$ (shape $n^{(l)} \times B$) with $\bm{A}^{(l-1)}$ (shape $B \times n^{(l-1)}$).

Cost: $2 \cdot n^{(l)} \cdot B \cdot n^{(l-1)}$ FLOPs — same order as the forward pass matmul.

(b) Gradient with respect to biases $\bm{b}^{(l)}$:

$$\pd{L}{\bm{b}^{(l)}} = \frac{1}{B}\sum_{i=1}^{B} \bm{\delta}_i^{(l)} = \frac{1}{B}\;\mathbf{1}\T \bm{\delta}^{(l)} \in \R^{n^{(l)}}$$

This is a sum over the batch dimension.

Cost: $B \cdot n^{(l)}$ additions = $B \cdot n^{(l)}$ FLOPs.

(c) Gradient with respect to input $\bm{A}^{(l-1)}$ (to propagate backward):

$$\pd{L}{\bm{A}^{(l-1)}} = \bm{\delta}^{(l)} \bm{W}^{(l)} \in \R^{B \times n^{(l-1)}}$$

Matrix multiplication of $\bm{\delta}^{(l)}$ (shape $B \times n^{(l)}$) with $\bm{W}^{(l)}$ (shape $n^{(l)} \times n^{(l-1)}$).

Cost: $2 \cdot B \cdot n^{(l)} \cdot n^{(l-1)}$ FLOPs — same order as the forward pass matmul.

(d) Gradient through activation function (for the next layer down):

$$\bm{\delta}^{(l-1)} = \pd{L}{\bm{Z}^{(l-1)}} = \pd{L}{\bm{A}^{(l-1)}} \odot \sigma'\!\left(\bm{Z}^{(l-1)}\right)$$

Element-wise (Hadamard) product with the activation derivative.

Cost: $B \cdot n^{(l-1)}$ multiplications (+ cost of computing $\sigma'$) = $(1 + c_{\sigma'}) \cdot B \cdot n^{(l-1)}$ FLOPs.

Figure 5.2. Complete backward pass through layer $l$. From the incoming gradient $\bm{\delta}^{(l)}$, we compute four quantities: (a) weight gradient via matmul, (b) bias gradient via sum, (c) input gradient via matmul, and (d) pre-activation gradient via element-wise multiply. Steps (a) and (b) update this layer's parameters; steps (c)+(d) produce the gradient for the layer below.

Backward Pass — Total Cost for Layer $l$

Step	Operation	FLOPs
(a) $\pd{L}{\bm{W}^{(l)}}$	$\bm{\delta}^{(l)\top}\bm{A}^{(l-1)} / B$	$2 B n^{(l)} n^{(l-1)}$
(b) $\pd{L}{\bm{b}^{(l)}}$	$\mathbf{1}\T\bm{\delta}^{(l)} / B$	$B n^{(l)}$
(c) $\pd{L}{\bm{A}^{(l-1)}}$	$\bm{\delta}^{(l)} \bm{W}^{(l)}$	$2 B n^{(l)} n^{(l-1)}$
(d) $\bm{\delta}^{(l-1)}$	$\pd{L}{\bm{A}^{(l-1)}} \odot \sigma'(\bm{Z}^{(l-1)})$	$(1+c_{\sigma'}) B n^{(l-1)}$
Total		$\approx\mathbf{4Bn^{(l)}n^{(l-1)}}$ (matmul-dominated)

Compare to the forward pass for layer $l$: $\approx 2Bn^{(l-1)}n^{(l)}$ FLOPs. The backward pass is indeed 2× the forward pass, as predicted by AD theory (two matrix multiplications vs. one).

Total Training Step — Full Network

$$\text{FLOPs}_{\text{forward}} \approx 2B \sum_{l=1}^{L} n^{(l-1)} n^{(l)}$$ $$\text{FLOPs}_{\text{backward}} \approx 4B \sum_{l=1}^{L} n^{(l-1)} n^{(l)}$$ $$\boxed{\text{FLOPs}_{\text{total}} \approx 6B \sum_{l=1}^{L} n^{(l-1)} n^{(l)}}$$

Or equivalently: $\text{FLOPs}_{\text{total}} \approx 3 \times \text{FLOPs}_{\text{forward}}$

Peak memory during training (float32):

$$\text{Memory} = \underbrace{4P}_{\text{params}} + \underbrace{4P}_{\text{gradients}} + \underbrace{4 \cdot M_{\text{opt}} \cdot P}_{\text{optimizer}} + \underbrace{4B \sum_{l=0}^{L} n^{(l)}}_{\text{cached activations}}$$

where $M_{\text{opt}} = 0$ for SGD, $1$ for SGD+momentum, $2$ for Adam.

5.4 Complete Worked Example: MNIST-like Network

Let us now compute every single metric for a concrete network. We consider a 3-layer MLP for MNIST digit classification:

Network Architecture

Architecture: 784 → 256 → 128 → 10

$n^{(0)} = 784$ (28×28 pixels), $n^{(1)} = 256$, $n^{(2)} = 128$, $n^{(3)} = 10$ (digits 0-9)

Activation: ReLU for hidden layers, Softmax for output

Loss: Cross-entropy

Batch size: $B = 32$

Parameter Count

Layer	Weight Shape	Weights	Biases	Total Params
Layer 1 (784→256)	256 × 784	200,704	256	200,960
Layer 2 (256→128)	128 × 256	32,768	128	32,896
Layer 3 (128→10)	10 × 128	1,280	10	1,290
Total		234,752	394	235,146

Parameter memory (float32): $235{,}146 \times 4 = 940{,}584$ bytes $\approx$ 0.94 MB

Forward Pass FLOPs (per batch of 32)

Layer	MatMul	Bias	Activation	Layer Total
Layer 1	$2 \times 32 \times 784 \times 256 = 12{,}845{,}056$	$32 \times 256 = 8{,}192$	$32 \times 256 = 8{,}192$ (ReLU)	12,861,440
Layer 2	$2 \times 32 \times 256 \times 128 = 2{,}097{,}152$	$32 \times 128 = 4{,}096$	$32 \times 128 = 4{,}096$ (ReLU)	2,105,344
Layer 3	$2 \times 32 \times 128 \times 10 = 81{,}920$	$32 \times 10 = 320$	$5 \times 32 \times 10 = 1{,}600$ (Softmax)	83,840
Loss	Cross-entropy: $32 \times 10 \times 2 = 640$ FLOPs			640
Total Forward				15,051,264 ≈ 15.1M FLOPs

Backward Pass FLOPs (per batch of 32)

Layer	(a) ∂L/∂W	(b) ∂L/∂b	(c) ∂L/∂A	(d) δ element-wise	Layer Total
Layer 3	$2 \times 10 \times 32 \times 128$ = 81,920	$32 \times 10$ = 320	$2 \times 32 \times 10 \times 128$ = 81,920	$32 \times 128$ = 4,096	168,256
Layer 2	$2 \times 128 \times 32 \times 256$ = 2,097,152	$32 \times 128$ = 4,096	$2 \times 32 \times 128 \times 256$ = 2,097,152	$32 \times 256$ = 8,192	4,206,592
Layer 1	$2 \times 256 \times 32 \times 784$ = 12,845,056	$32 \times 256$ = 8,192	$2 \times 32 \times 256 \times 784$ = 12,845,056	— (input layer)	25,698,304
Total Backward					30,073,152 ≈ 30.1M FLOPs

Verification — The 3× Rule

Forward: 15.1M FLOPs. Backward: 30.1M FLOPs. Ratio: $30.1 / 15.1 \approx 2.0$

Total: $15.1 + 30.1 = 45.2$M FLOPs $\approx 3 \times 15.1$M ✓

This confirms the theoretical prediction that total training cost ≈ 3× forward pass.

Memory Analysis (float32)

Component	Elements	Bytes	MB
Parameters	235,146	940,584	0.90
Gradients (same size as params)	235,146	940,584	0.90
Adam optimizer state (m + v)	$2 \times 235{,}146 = 470{,}292$	1,881,168	1.79
Cached activations: $\bm{A}^{(0)}$ (input)	$32 \times 784 = 25{,}088$	100,352	0.10
Cached activations: $\bm{Z}^{(1)}$ or $\bm{A}^{(1)}$	$32 \times 256 = 8{,}192$	32,768	0.03
Cached activations: $\bm{Z}^{(2)}$ or $\bm{A}^{(2)}$	$32 \times 128 = 4{,}096$	16,384	0.02
Cached activations: $\bm{Z}^{(3)}$ (output logits)	$32 \times 10 = 320$	1,280	~0
Total			≈ 3.74 MB

Key Observations

(1) Layer 1 dominates everything: it has 85.4% of all parameters and accounts for 85.3% of forward FLOPs and 85.5% of backward FLOPs. This is because it connects the high-dimensional input (784) to a wide hidden layer (256). In practice, the first layer of an MLP processing raw pixels is the bottleneck — this is precisely the problem that CNNs solve by using local connectivity.

(2) Adam triples the parameter memory: storing $\bm{m}$ and $\bm{v}$ alongside the parameters means $3 \times$ the parameter memory (params + gradients + 2× optimizer state). For this small network it's negligible, but for billion-parameter models, this becomes a major constraint.

(3) Activation memory is small here because the batch size (32) and hidden sizes (256, 128) are modest. For larger networks and batch sizes, activation memory can far exceed parameter memory.

References for §5

[3] Goodfellow et al. (2016). Deep Learning, Ch. 6: Deep Feedforward Networks. — The canonical reference for MLP forward/backward computation.

[4] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.

[5] Nielsen, M. (2015). Neural Networks and Deep Learning, Ch. 2: How the backpropagation algorithm works. Online: neuralnetworksanddeeplearning.com.

[6] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.

§6 Activation Functions — Detailed Analysis

Activation functions introduce the non-linearity that makes neural networks more powerful than a single linear transformation. Without them, any MLP would collapse to a single matrix multiplication regardless of depth: $\bm{W}^{(L)}\bm{W}^{(L-1)}\cdots\bm{W}^{(1)}\bm{x} = \bm{W}_{\text{equiv}}\bm{x}$. Each activation has distinct computational costs, gradient properties, and implications for training dynamics.

6.1 Sigmoid

Sigmoid Function

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Output range: $(0, 1)$

Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$

Figure 6.1. Sigmoid and its derivative. The derivative peaks at 0.25 and decays to zero in both directions — causing vanishing gradients when $|z|$ is large. After $L$ layers, the gradient is multiplied by factors ≤ 0.25 at each layer, decaying as $\leq 0.25^L$.

Sigmoid — Computational Details

Metric	Value
Forward: $\sigma(z) = 1/(1+e^{-z})$	1 negation + 1 exp + 1 add + 1 division ≈ 4 FLOPs/element
Backward: $\sigma'(z) = \sigma(z)(1-\sigma(z))$	1 subtraction + 1 multiplication ≈ 2 FLOPs/element (if $\sigma(z)$ cached)
Memory for backprop	Must cache $\sigma(z)$ = $\bm{a}^{(l)}$ — 1 float per element
Gradient property	$\sigma'(z) \in (0, 0.25]$ — always shrinks gradients
Output centering	Output $\in (0,1)$, mean ≈ 0.5 — not zero-centered

Vanishing Gradient Problem

After $L$ layers with sigmoid activations, the gradient at layer 1 is multiplied by the product $\prod_{l=1}^{L}\sigma'(z^{(l)})$. Since each $\sigma' \leq 0.25$:

$$\left|\prod_{l=1}^{L} \sigma'(z^{(l)})\right| \leq 0.25^L$$

For $L = 10$: $0.25^{10} \approx 10^{-6}$. For $L = 20$: $0.25^{20} \approx 10^{-12}$. Gradients vanish exponentially, making deep networks with sigmoid essentially untrainable. This was the key obstacle identified by Hochreiter (1991) and Bengio et al. (1994).

6.2 Tanh

Hyperbolic Tangent

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$

Output range: $(-1, 1)$ — zero-centered ✓

Derivative: $\tanh'(z) = 1 - \tanh^2(z)$

Tanh — Computational Details

Metric	Value
Forward FLOPs	2 exp + 1 subtraction + 1 addition + 1 division ≈ 6 FLOPs/element
Backward FLOPs	1 square + 1 subtraction = 2 FLOPs/element (if $\tanh(z)$ cached)
Max derivative	$\tanh'(0) = 1$ — better than sigmoid's 0.25
Gradient property	$\tanh'(z) \in (0, 1]$ — still saturates for large $\|z\|$

Tanh is more expensive than sigmoid in the forward pass (6 vs 4 FLOPs) but has better gradient flow properties (max derivative 1 vs 0.25). The zero-centered output eliminates the zig-zag gradient update problem of sigmoid. Despite this, it still suffers from vanishing gradients in deep networks.

6.3 ReLU (Rectified Linear Unit)

ReLU

$$\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}$$

Output range: $[0, \infty)$

Derivative: $\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases}$ (undefined at $z=0$, typically set to 0)

ReLU — Computational Details

Metric	Value
Forward FLOPs	1 comparison + 1 conditional select ≈ 1 FLOP/element
Backward FLOPs	1 comparison ≈ 1 FLOP/element
Memory for backprop	Only need sign of $z$: 1 bit per element (theoretically)
Gradient property	Exactly 1 for $z > 0$, exactly 0 for $z < 0$ — no shrinkage, no saturation
Sparsity	~50% of outputs are zero (for zero-mean inputs) — natural sparsity

Why ReLU Dominates — Computational Efficiency Perspective

Comparing to sigmoid and tanh: ReLU is 4–6× cheaper in the forward pass and requires drastically less memory for backprop (1 bit vs 32 bits to cache). But the primary advantage is gradient flow: the derivative is exactly 1 for positive inputs, meaning gradients propagate through ReLU layers without any decay. For a network with $L$ ReLU layers where all pre-activations are positive:

$$\prod_{l=1}^{L} \text{ReLU}'(z^{(l)}) = 1^L = 1$$

No vanishing gradient, regardless of depth. This is the key insight of Nair & Hinton (2010) and the reason ReLU enabled training of much deeper networks.

Dying ReLU Problem

If a neuron's pre-activation $z$ becomes permanently negative (due to a large negative bias or an unlucky weight update), its output and gradient are permanently zero. The neuron is "dead" — it will never be updated again. In practice, 10–40% of neurons can die during training. This motivates the variants below.

6.4 Leaky ReLU and Parametric ReLU (PReLU)

Leaky ReLU / PReLU

$$f(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}$$

Leaky ReLU: $\alpha$ is a fixed small constant (typically 0.01)

PReLU: $\alpha$ is a learnable parameter (one per channel or one global)

Derivative: $f'(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \leq 0 \end{cases}$

Leaky ReLU — Computational Details

Metric	Value
Forward FLOPs	1 comparison + 1 conditional multiply ≈ 2 FLOPs/element
Backward FLOPs	1 comparison + 1 conditional ≈ 2 FLOPs/element
Extra parameters (PReLU)	1 per channel (if per-channel) or 1 total (if shared)
Gradient for $z < 0$	$\alpha \neq 0$ — no dead neurons

The small slope $\alpha$ for negative inputs ensures gradients always flow, fixing the dying ReLU problem at the cost of 1 extra FLOP per element. The overall computational overhead compared to ReLU is negligible.

6.5 ELU, GELU, and Swish/SiLU

ELU (Exponential Linear Unit)

ELU

$$f(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

Forward cost: 1 comparison + (conditionally) 1 exp + 1 subtraction + 1 multiplication ≈ 2–4 FLOPs/element

Advantage: Smooth for $z < 0$, zero-centered mean activations, no dying neurons.

Clevert, Unterthiner & Hochreiter (2016), "Fast and Accurate Deep Network Learning by Exponential Linear Units"

GELU (Gaussian Error Linear Unit)

GELU

$$\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]$$

Practical approximation:

$$\text{GELU}(z) \approx 0.5z\left(1 + \tanh\!\left[\sqrt{2/\pi}\left(z + 0.044715z^3\right)\right]\right)$$

Forward cost (approximate form): 1 cube + 2 multiplications + 1 addition + 1 tanh (~6 FLOPs) + 1 addition + 1 multiplication ≈ 12–14 FLOPs/element

Used in: BERT, GPT, most modern Transformers

Hendrycks & Gimpel (2016), "Gaussian Error Linear Units (GELUs)"

Swish / SiLU

Swish (SiLU)

$$f(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}$$

Derivative: $f'(z) = \sigma(z) + z \cdot \sigma(z)(1 - \sigma(z)) = f(z) + \sigma(z)(1 - f(z))$

Forward cost: 1 sigmoid (~4 FLOPs) + 1 multiplication ≈ 5 FLOPs/element

Backward cost: ≈ 5 FLOPs/element

Ramachandran, Zoph & Le (2017), "Searching for Activation Functions" — found via automated search

6.6 Softmax (Output Layer for Classification)

Softmax is not technically an activation function for hidden layers — it is used at the output layer to convert logits into a probability distribution. Its full computational analysis was provided in Part I §3.2. Key recap:

Softmax Recap

$$\text{softmax}(z_c) = \frac{e^{z_c - \max(\bm{z})}}{\sum_{j=1}^{C} e^{z_j - \max(\bm{z})}}$$

Forward: $\sim 5C$ FLOPs. When fused with cross-entropy, the backward gradient simplifies to $\hat{\bm{y}} - \bm{y}$, costing only $C$ FLOPs.

6.7 Comparative Summary Table

Activation	Forward FLOPs	Backward FLOPs	Cache Need	Max \|f'\|	Vanishing?	Dead Neurons?
Sigmoid	4	2	$a$ (32 bit)	0.25	Yes ✗	No
Tanh	6	2	$a$ (32 bit)	1.0	Yes ✗	No
ReLU	1	1	sign (1 bit)	1.0	No ✓	Yes ✗
Leaky ReLU	2	2	sign (1 bit)	1.0	No ✓	No ✓
PReLU	2	2	$z$ (32 bit)	1.0	No ✓	No ✓
ELU	2–4	2–3	$z$ or $a$ (32 bit)	1.0	No ✓	No ✓
GELU	12–14	12–14	$z$ (32 bit)	~1.08	No ✓	No ✓
Swish/SiLU	5	5	$z$, $\sigma(z)$ (64 bit)	~1.10	No ✓	No ✓

Efficiency Ranking

For computational efficiency research, the key trade-off is: ReLU is the cheapest by a wide margin (1 FLOP, 1 bit memory), but GELU and Swish often yield slightly better accuracy at 5–14× the computational cost per element. Since activation cost is typically 1–2% of total layer cost (dominated by matrix multiplication), this difference rarely matters for overall FLOP count. The gradient flow properties (avoiding vanishing gradients) have a much larger impact on total training cost because they affect the number of iterations required to converge.

References for §6

[7] Nair, V. & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML 2010.

[8] Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. AISTATS 2011.

[9] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. — PReLU and He initialization.

[10] Clevert, D., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units. ICLR 2016.

[11] Hendrycks, D. & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.

[12] Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941.

[13] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich. — First identification of vanishing gradient.

[14] Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Trans. Neural Networks, 5(2), 157–166.

[3] Goodfellow et al. (2016). Deep Learning, Ch. 6.3: Hidden Units.

§7 Weight Initialization

Weight initialization determines the starting point of optimization. A poor initialization can make the difference between a network that converges in 10 epochs and one that never converges at all — directly translating to orders of magnitude difference in total training FLOPs.

7.1 The Variance Propagation Problem

The fundamental question is: if the input to a layer has variance $\text{Var}(x)$, what is the variance of the output $\text{Var}(z)$ after the linear transformation $z_j = \sum_{i=1}^{n_{\text{in}}} w_{ji} x_i$?

Variance Propagation Through a Linear Layer

Assuming inputs $x_i$ are i.i.d. with zero mean and variance $\text{Var}(x)$, and weights $w_{ji}$ are i.i.d. with zero mean and variance $\text{Var}(w)$, and inputs and weights are independent:

$$\text{Var}(z_j) = \text{Var}\!\left(\sum_{i=1}^{n_{\text{in}}} w_{ji} x_i\right) = \sum_{i=1}^{n_{\text{in}}} \text{Var}(w_{ji}) \cdot \text{Var}(x_i) = n_{\text{in}} \cdot \text{Var}(w) \cdot \text{Var}(x)$$

For the variance to be preserved ($\text{Var}(z) = \text{Var}(x)$), we need:

$$\boxed{n_{\text{in}} \cdot \text{Var}(w) = 1 \quad \Longrightarrow \quad \text{Var}(w) = \frac{1}{n_{\text{in}}}}$$

If $\text{Var}(w) > 1/n_{\text{in}}$, the variance grows at each layer: after $L$ layers, $\text{Var}(z) \propto (n \cdot \text{Var}(w))^L \to \infty$. Activations explode. If $\text{Var}(w) < 1/n_{\text{in}}$, the variance shrinks: $\text{Var}(z) \to 0$. Activations vanish. Both situations destroy training.

7.2 Xavier/Glorot Initialization

Xavier Initialization (Glorot & Bengio, 2010)

Designed for sigmoid and tanh activations (assuming the linear regime near zero where $\sigma'(0) \approx 1$):

Uniform:

$$w_{ij} \sim U\!\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},\; \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$$

Normal:

$$w_{ij} \sim \mathcal{N}\!\left(0,\; \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$$

The $n_{\text{in}} + n_{\text{out}}$ denominator is a compromise: preserving variance in both the forward pass (which wants $1/n_{\text{in}}$) and the backward pass (which wants $1/n_{\text{out}}$). The harmonic mean $2/(n_{\text{in}} + n_{\text{out}})$ balances both.

7.3 He/Kaiming Initialization

He Initialization (He et al., 2015)

Designed for ReLU activations. ReLU zeros out approximately half of the outputs (those with $z \leq 0$), effectively halving the variance. To compensate:

$$w_{ij} \sim \mathcal{N}\!\left(0,\; \frac{2}{n_{\text{in}}}\right)$$

or equivalently, $\text{Var}(w) = 2/n_{\text{in}}$ — twice the Xavier variance to account for the factor-of-2 reduction from ReLU's zeroing.

Initialization — Computational Cost

Initialization is a one-time cost at the start of training:

Operation	Cost
Generate $P$ random numbers	$O(P)$ operations
Scale to desired variance	$O(P)$ multiplications
Total	$O(P)$ — negligible vs. training cost

However, the indirect computational cost is enormous: poor initialization leads to slow or failed convergence, potentially wasting millions of times more FLOPs in unnecessary training iterations.

Numerical Example — Initialization Variance

For our MNIST network (784→256→128→10):

Layer 1 (784→256), He init: $\text{Var}(w) = 2/784 = 0.00255$, so $\text{std}(w) = 0.0505$

Layer 1 (784→256), Xavier init: $\text{Var}(w) = 2/1040 = 0.00192$, so $\text{std}(w) = 0.0438$

Layer 2 (256→128), He init: $\text{Var}(w) = 2/256 = 0.00781$, so $\text{std}(w) = 0.0884$

Layer 3 (128→10), He init: $\text{Var}(w) = 2/128 = 0.01563$, so $\text{std}(w) = 0.125$

Narrower layers (more input connections) require smaller individual weights to prevent variance explosion.

References for §7

[15] Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010.

[9] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers. ICCV 2015.

[16] Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR 2014.

[3] Goodfellow et al. (2016). Deep Learning, Ch. 8.4: Parameter Initialization Strategies.

§ Computational Summary Tables

Master Table: MLP Layer Costs

Metric	Formula	MNIST Example (784→256, B=32)
Parameters	$n^{(l)}(n^{(l-1)} + 1)$	$256 \times 785 = 200{,}960$
Param memory (float32)	$\text{params} \times 4$ bytes	803,840 bytes = 0.77 MB
Forward FLOPs (matmul)	$2Bn^{(l-1)}n^{(l)}$	$2 \times 32 \times 784 \times 256 = 12.8$M
Forward FLOPs (total)	$(2n^{(l-1)} + 1 + c_\sigma)Bn^{(l)}$	≈ 12.9M
Backward FLOPs	$\approx 4Bn^{(l-1)}n^{(l)}$	≈ 25.7M
Forward + backward	$\approx 6Bn^{(l-1)}n^{(l)}$	≈ 38.5M
Cached activations	$Bn^{(l-1)} + Bn^{(l)}$ elements	$(32 \times 784) + (32 \times 256) = 33{,}280$ elements
Cached activation memory	$\text{elements} \times 4$ bytes	133,120 bytes = 0.13 MB

Master Table: Training Memory Breakdown

Component	Size (elements)	Depends On
Parameters $\bm{\theta}$	$P$	Architecture only
Gradients $\nabla\bm{\theta}$	$P$	Architecture only
Adam first moment $\bm{m}$	$P$	Architecture only
Adam second moment $\bm{v}$	$P$	Architecture only
Cached activations	$B\sum_{l=0}^{L}n^{(l)}$	Architecture + batch size
Temporary gradient tensors	$\max_l(Bn^{(l)})$	Largest layer × batch
Total (Adam)	$\mathbf{4P + B\sum n^{(l)} + \text{temp}}$

Full MNIST Network: One-Glance Summary

Metric	Value
Architecture	784 → 256 (ReLU) → 128 (ReLU) → 10 (Softmax)
Total parameters	235,146
Parameter memory	0.90 MB
Total training memory (Adam, B=32)	3.74 MB
Forward FLOPs (B=32)	15.1M
Backward FLOPs (B=32)	30.1M
Total per-step FLOPs	45.2M
FLOPs per sample (forward)	~471K
FLOPs per epoch (60K samples, B=32)	$1{,}875 \times 45.2\text{M} \approx 84.7$G FLOPs
Dominant cost center	Layer 1 (85% of all FLOPs)

§ Complete References

Foundational Papers

[1] Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.

[2] Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.

[4] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.

Activation Functions

[7] Nair, V. & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML 2010.

[8] Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. AISTATS 2011.

[10] Clevert, D., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units. ICLR 2016.

[11] Hendrycks, D. & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.

[12] Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941.

[13] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich.

[14] Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Trans. Neural Networks, 5(2), 157–166.

Initialization

[9] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015.

[15] Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010.

[16] Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR 2014.

Textbooks

[3] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Ch. 6 (Deep Feedforward Networks), Ch. 8.4 (Initialization).

[5] Nielsen, M. (2015). Neural Networks and Deep Learning. Online: neuralnetworksanddeeplearning.com. Ch. 1–2.

Computational Efficiency Survey

[6] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, 105(12), 2295–2329.

The Multilayer Perceptron — Complete Dissection

Contents

§4 The Single Neuron (Perceptron)

4.1 Mathematical Model

4.2 Geometric Interpretation

4.3 Limitations — The XOR Problem

References for §4

§5 MLP Architecture — Layer-by-Layer Analysis

5.1 Network Topology and Notation

5.2 Forward Pass — Complete Derivation

5.3 Backward Pass (Backpropagation) — Complete Derivation

Step 0: Loss Gradient (Starting Point)

Step 1: Gradient Through Each Layer (from $l = L$ down to $l = 1$)

5.4 Complete Worked Example: MNIST-like Network

Parameter Count

Forward Pass FLOPs (per batch of 32)

Backward Pass FLOPs (per batch of 32)

Memory Analysis (float32)

References for §5

§6 Activation Functions — Detailed Analysis

6.1 Sigmoid

6.2 Tanh

6.3 ReLU (Rectified Linear Unit)

6.4 Leaky ReLU and Parametric ReLU (PReLU)

6.5 ELU, GELU, and Swish/SiLU

ELU (Exponential Linear Unit)

GELU (Gaussian Error Linear Unit)

Swish / SiLU

6.6 Softmax (Output Layer for Classification)

6.7 Comparative Summary Table

References for §6

§7 Weight Initialization

7.1 The Variance Propagation Problem

7.2 Xavier/Glorot Initialization

7.3 He/Kaiming Initialization

References for §7

§ Computational Summary Tables

Master Table: MLP Layer Costs

Master Table: Training Memory Breakdown

Full MNIST Network: One-Glance Summary

§ Complete References

Foundational Papers

Activation Functions

Initialization

Textbooks

Computational Efficiency Survey