Part I of VI

Mathematical Foundations for Deep Neural Networks

A research-depth treatment of the linear algebra, calculus, and probability theory underlying neural network computation — with full computational cost analysis

Contents

§1 Linear Algebra for Neural Networks

Every computation in a neural network — from the forward pass to backpropagation to the optimizer update — ultimately reduces to linear algebra operations on multi-dimensional arrays of numbers. A thorough understanding of these operations and their computational costs is the foundation upon which all efficiency analysis rests.

1.1 Scalars, Vectors, Matrices, and Tensors

Definitions

A scalar is a single number, $x \in \R$. A vector is an ordered array of $n$ scalars, $\bm{x} \in \R^n$. A matrix is a 2D array of scalars, $\bm{A} \in \R^{m \times n}$, with $m$ rows and $n$ columns. A tensor is the generalization to any number of dimensions (or rank): a rank-$k$ tensor has $k$ indices.

In deep learning, the fundamental data structure is the tensor. Every input, output, weight, gradient, and activation in a neural network is stored as a tensor. The shape of a tensor fully determines its memory footprint:

Memory Cost

For a tensor with shape $(d_1, d_2, \ldots, d_k)$ stored in a given numerical precision:

$$\text{Memory (bytes)} = \left(\prod_{i=1}^{k} d_i\right) \times \text{bytes\_per\_element}$$

Common precisions: float32 = 4 bytes, float16 / bfloat16 = 2 bytes, float64 = 8 bytes, int8 = 1 byte.

Worked Example — Tensor Memory

Consider a batch of 32 RGB images of size 224×224, represented as a tensor $\bm{X} \in \R^{32 \times 3 \times 224 \times 224}$:

$$\text{Elements} = 32 \times 3 \times 224 \times 224 = 4{,}816{,}896$$ $$\text{Memory (float32)} = 4{,}816{,}896 \times 4 = 19{,}267{,}584 \text{ bytes} \approx 18.4 \text{ MB}$$ $$\text{Memory (float16)} = 4{,}816{,}896 \times 2 = 9{,}633{,}792 \text{ bytes} \approx 9.2 \text{ MB}$$

This immediately illustrates why mixed-precision training halves the activation memory.

The rank (number of dimensions) of tensors commonly encountered in deep learning:

Rank Name Shape Convention Example
0 Scalar $()$ Loss value $L$
1 Vector $(n,)$ Bias vector $\bm{b} \in \R^{n}$
2 Matrix $(m, n)$ Weight matrix $\bm{W} \in \R^{m \times n}$
3 3-Tensor $(B, T, d)$ Batch of sequences (NLP)
4 4-Tensor $(B, C, H, W)$ Batch of feature maps (CNN)

References for §1.1

[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2: Linear Algebra. MIT Press. Available at deeplearningbook.org.

[2] Strang, G. (2016). Introduction to Linear Algebra, 5th Ed. Wellesley-Cambridge Press. Chapters 1–3.

1.2 Core Matrix Operations and Their Computational Cost

Before dissecting neural network layers, we must establish the exact cost of the atomic operations from which they are built. In computational analysis, we count FLOPs (floating-point operations), where one FLOP is typically one addition or one multiplication. A multiply-accumulate (MAC) operation — $a \leftarrow a + b \times c$ — counts as 2 FLOPs.

Vector-Vector Operations

Dot Product

Given $\bm{x}, \bm{y} \in \R^n$, the dot product is:

$$\bm{x} \cdot \bm{y} = \bm{x}^\top \bm{y} = \sum_{i=1}^{n} x_i y_i$$

This requires $n$ multiplications and $(n-1)$ additions = $\mathbf{2n - 1}$ FLOPs. In practice, this is often rounded to $2n$ FLOPs (counting one MAC per element).

Element-wise (Hadamard) Product

$\bm{x} \odot \bm{y}$: multiply corresponding elements. For vectors of length $n$: exactly $n$ FLOPs (multiplications only). Output is a vector of length $n$.

Outer Product

$\bm{x} \bm{y}^\top$ for $\bm{x} \in \R^m$, $\bm{y} \in \R^n$: produces an $m \times n$ matrix. Cost: $mn$ multiplications = $mn$ FLOPs. Memory for result: $mn$ elements.

Vector Addition and Scaling

Vector addition $\bm{x} + \bm{y}$ for $\bm{x}, \bm{y} \in \R^n$: exactly $n$ FLOPs. Scalar-vector multiplication $\alpha \bm{x}$: exactly $n$ FLOPs. These are trivially cheap, but they appear billions of times during training, so their cost is not negligible in aggregate.

1.3 Matrix Multiplication — The Central Operation of Deep Learning

Matrix multiplication is, without exaggeration, the core operation of neural network computation. Every fully-connected layer, every convolution (via im2col), every attention mechanism — all reduce to matrix multiplication at the hardware level. Understanding its cost is essential.

Definition — Matrix Multiplication

Given $\bm{A} \in \R^{m \times n}$ and $\bm{B} \in \R^{n \times p}$, the product $\bm{C} = \bm{A}\bm{B} \in \R^{m \times p}$ is defined as:

$$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$

Each element $C_{ij}$ is a dot product of row $i$ of $\bm{A}$ with column $j$ of $\bm{B}$.

A (m × n) m rows n cols row i × B (n × p) col j = C (m × p) ij C_ij = dot(row_i, col_j)
Figure 1.1. Matrix multiplication $\bm{C} = \bm{A}\bm{B}$. Element $C_{ij}$ is the dot product of the $i$-th row of $\bm{A}$ with the $j$-th column of $\bm{B}$, requiring $2n$ FLOPs.
Computational Cost — Matrix Multiplication

For $\bm{A} \in \R^{m \times n}$ and $\bm{B} \in \R^{n \times p}$:

$$\text{FLOPs} = 2mnp$$

(Each of the $mp$ output elements requires $n$ multiplications and $n$ additions = $2n$ FLOPs.)

$$\text{Memory for result: } m \times p \text{ elements}$$

This is the single most important formula in computational efficiency analysis of neural networks.

Worked Example — Fully Connected Layer as Matrix Multiply

A fully-connected layer with 784 inputs and 256 outputs, processing a batch of 32 samples:

Input: $\bm{X} \in \R^{32 \times 784}$, Weight: $\bm{W} \in \R^{256 \times 784}$

Output: $\bm{Z} = \bm{X}\bm{W}^\top \in \R^{32 \times 256}$

$$\text{FLOPs} = 2 \times 32 \times 784 \times 256 = 12{,}845{,}056 \approx 12.8\text{M FLOPs}$$ $$\text{Output memory (float32)} = 32 \times 256 \times 4 = 32{,}768 \text{ bytes} = 32 \text{ KB}$$

The bias addition $\bm{Z} + \bm{b}$ adds another $32 \times 256 = 8{,}192$ FLOPs — negligible compared to the matmul.

Matrix-Vector Multiplication

The special case $\bm{y} = \bm{A}\bm{x}$ where $\bm{A} \in \R^{m \times n}$ and $\bm{x} \in \R^n$ costs $2mn$ FLOPs and produces $\bm{y} \in \R^m$. This is the computation for a single sample through a fully-connected layer (without batching).

Why Matrix Multiplication Dominates

For a network with weight matrices of dimension $d$, the matrix multiplication cost scales as $O(d^2)$ per sample (or $O(Bd^2)$ per batch), while element-wise operations like activations scale as $O(d)$ (or $O(Bd)$). For typical hidden sizes ($d = 256$ to $d = 4096$), the matmul is 2–3 orders of magnitude more expensive than element-wise ops. This is why GPU hardware is designed with tensor cores that accelerate matrix multiplication specifically (NVIDIA, 2017).

References for §1.2–1.3

[1] Goodfellow et al. (2016). Deep Learning, Ch. 2.3: Multiplying Matrices and Vectors.

[2] Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations, 4th Ed. Johns Hopkins University Press. — The definitive reference on computational cost of matrix operations.

[3] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, 105(12), 2295–2329. — Table II provides FLOP formulas for all DNN layer types.

1.4 Special Matrix Operations

Transpose

The transpose $\bm{A}^\top$ of a matrix $\bm{A} \in \R^{m \times n}$ satisfies $(\bm{A}^\top)_{ij} = A_{ji}$. In modern deep learning frameworks (PyTorch, TensorFlow), the transpose is a zero-cost operation — it simply changes the metadata (strides) describing how the data is laid out in memory, without moving any data. This costs $O(1)$ in time and $O(1)$ in extra memory. However, a transposed matrix may have non-contiguous memory access patterns, which can slow down subsequent operations due to cache inefficiency.

Matrix Inverse

The inverse $\bm{A}^{-1}$ of a square matrix $\bm{A} \in \R^{n \times n}$ costs $O(n^3)$ FLOPs to compute (using LU decomposition or Gaussian elimination). In deep learning, we almost never compute matrix inverses directly. The operations that might seem to require an inverse — such as solving linear systems or computing the Hessian inverse — are handled by iterative methods (conjugate gradient) or approximations (L-BFGS, natural gradient). When the Hessian $\bm{H} \in \R^{P \times P}$ has $P$ = millions of parameters, storing it alone would require $P^2 \times 4$ bytes, which is completely infeasible.

Why the Hessian is Infeasible

For a small network with $P = 10^6$ parameters (1 million):

$$\text{Hessian memory} = (10^6)^2 \times 4 \text{ bytes} = 4 \times 10^{12} \text{ bytes} = 4 \text{ TB}$$

And computing it would require $O(P^2)$ = $10^{12}$ operations per element. This is why first-order methods (SGD, Adam) dominate deep learning — they only need $O(P)$ memory for gradients.

Eigendecomposition and Singular Value Decomposition (SVD)

Eigendecomposition factors a square matrix as $\bm{A} = \bm{Q}\bm{\Lambda}\bm{Q}^{-1}$, where $\bm{\Lambda}$ is diagonal with eigenvalues. SVD factors any $m \times n$ matrix as $\bm{A} = \bm{U}\bm{\Sigma}\bm{V}^\top$. Both cost $O(n^3)$ for an $n \times n$ matrix.

In deep learning, these are used for: (a) analyzing the condition number $\kappa = \sigma_{\max}/\sigma_{\min}$ of weight matrices and the Hessian, which affects optimization convergence; (b) low-rank approximation for model compression — truncating to the top $r$ singular values reduces a layer from $mn$ to $r(m+n)$ parameters; (c) understanding weight initialization — ensuring the singular values of weight matrices stay near 1 prevents gradient explosion/vanishing (Saxe et al., 2014).

References for §1.4

[3] Golub & Van Loan (2013). Matrix Computations, Ch. 5 (Eigenvalues), Ch. 8 (SVD).

[4] Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR 2014.

1.5 Norms and Their Computational Role

Norms measure the "size" of vectors and matrices. They appear throughout deep learning in regularization, gradient clipping, normalization layers, and convergence analysis.

Vector Norms

For a vector $\bm{x} \in \R^n$:

$L^1$ norm: $\|\bm{x}\|_1 = \sum_{i=1}^n |x_i|$ — Cost: $n$ absolute values + $(n-1)$ additions ≈ $2n$ FLOPs

$L^2$ (Euclidean) norm: $\|\bm{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}$ — Cost: $n$ multiplications + $(n-1)$ additions + 1 square root ≈ $2n + 1$ FLOPs

$L^p$ norm: $\|\bm{x}\|_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}$ — Cost: $n$ powers + $(n-1)$ additions + 1 root ≈ $O(n)$ but with expensive per-element ops

$L^\infty$ norm: $\|\bm{x}\|_\infty = \max_i |x_i|$ — Cost: $n$ absolute values + $(n-1)$ comparisons ≈ $2n$ FLOPs

Role in Neural Networks

L2 regularization adds $\frac{\lambda}{2}\|\bm{W}\|_F^2 = \frac{\lambda}{2}\sum_{ij} W_{ij}^2$ to the loss. The Frobenius norm $\|\bm{W}\|_F$ for a matrix $\bm{W} \in \R^{m \times n}$ costs $mn$ multiplications + $(mn-1)$ additions ≈ $2mn$ FLOPs. But in practice, the gradient is simply $\lambda \bm{W}$, which costs only $mn$ multiplications to add to the existing gradient — so the per-step cost of L2 regularization is $O(P)$ where $P$ is the total parameter count.

Gradient clipping computes $\|\bm{g}\|_2$ for the full gradient vector $\bm{g} \in \R^P$ (cost: $2P$ FLOPs), then scales $\bm{g} \leftarrow \bm{g} \cdot \min(1, \tau/\|\bm{g}\|_2)$ (cost: $P$ FLOPs for the scaling). Total: $3P$ FLOPs — the same order as one gradient step.

Norm Operation FLOPs Use in Deep Learning
$\|\bm{x}\|_1$ (vector, length $n$) $2n$ L1 regularization (sparsity)
$\|\bm{x}\|_2$ (vector, length $n$) $2n + 1$ Gradient clipping, normalization
$\|\bm{W}\|_F$ (matrix $m \times n$) $2mn + 1$ L2/weight decay regularization
$\|\bm{x}\|_\infty$ (vector, length $n$) $2n$ Convergence analysis, adversarial robustness

References for §1.5

[1] Goodfellow et al. (2016). Deep Learning, Ch. 2.5: Norms.

[5] Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. ICML 2013. — Introduces gradient clipping and its mathematical justification.

1.6 Broadcasting

Broadcasting is a set of rules that allows operations between tensors of different shapes by virtually "expanding" the smaller tensor to match the larger one, without actually copying data in memory. This is critical for understanding both the computational cost and memory cost of neural network operations.

Broadcasting in a FC Layer — Bias Addition

In a fully-connected layer, after the matrix multiplication $\bm{Z} = \bm{X}\bm{W}^\top$ where $\bm{Z} \in \R^{B \times n}$, we add the bias $\bm{b} \in \R^{n}$:

$$\bm{Z} + \bm{b} \quad \text{where } \bm{Z} \text{ has shape } (B, n) \text{ and } \bm{b} \text{ has shape } (n,)$$

Broadcasting "virtually replicates" $\bm{b}$ across the batch dimension to shape $(B, n)$. The computation costs $Bn$ additions (same as if $\bm{b}$ were a full $B \times n$ matrix), but the memory for $\bm{b}$ remains just $n$ elements — not $Bn$. This is a memory saving factor of $B$.

In Batch Normalization, broadcasting is used extensively: the learned parameters $\gamma, \beta \in \R^C$ (one per channel) are broadcast across the entire batch and spatial dimensions $(B, C, H, W)$, saving a factor of $B \times H \times W$ in memory.

Computational Efficiency Insight

Broadcasting means that the FLOPs of an operation are determined by the output shape (after broadcasting), while the memory for the operands is determined by their original shapes. This distinction is crucial: an operation can have high FLOPs but low parameter memory (like bias addition), or high memory but low FLOPs (like storing activations).


§2 Calculus and Automatic Differentiation

If linear algebra is the language of neural network computation, calculus is the engine of learning. Every weight update in every neural network ever trained is computed via derivatives. The specific mechanism — backpropagation — is a direct application of the chain rule combined with an algorithmic insight about the order of computation. This section derives the mathematical machinery from first principles, with full computational cost analysis at each step.

2.1 Derivatives and Gradients

Definitions

For a function $f: \R \to \R$, the derivative is: $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

For a function $f: \R^n \to \R$, the partial derivative with respect to $x_i$ is: $\pd{f}{x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_n)}{h}$

The gradient is the vector of all partial derivatives: $\nabla_{\bm{x}} f = \left[\pd{f}{x_1}, \pd{f}{x_2}, \ldots, \pd{f}{x_n}\right]^\top \in \R^n$

The gradient $\nabla_{\bm{x}} f$ points in the direction of steepest ascent of $f$ at point $\bm{x}$. Gradient descent moves in the opposite direction: $\bm{x} \leftarrow \bm{x} - \eta \nabla_{\bm{x}} f$, where $\eta$ is the learning rate. The cost of one gradient descent step is the cost of computing $\nabla_{\bm{x}} f$ (which we will analyze in detail) plus $O(n)$ for the subtraction.

Gradient of Common Scalar Functions

These appear constantly in loss function computation and activation function derivatives:

Function $f(x)$ Derivative $f'(x)$ FLOPs per element
$x^2$ $2x$ 1 (multiply by constant)
$e^x$ $e^x$ ~1 (reuse cached $e^x$)
$\ln(x)$ $1/x$ 1 (division)
$\frac{1}{1+e^{-x}}$ (sigmoid) $\sigma(x)(1-\sigma(x))$ 2 (if $\sigma(x)$ cached)
$\tanh(x)$ $1 - \tanh^2(x)$ 2 (if $\tanh(x)$ cached)
$\max(0, x)$ (ReLU) $\mathbb{1}[x > 0]$ 1 (comparison)

The "FLOPs per element" column assumes that the forward pass value has been cached. This is standard practice: during the forward pass, activations are stored, and the backward pass reuses them to cheaply compute derivatives. This is a time-memory trade-off — we spend memory to store activations in order to save FLOPs during backpropagation.

2.2 The Chain Rule — Foundation of Backpropagation

The Chain Rule (Scalar Case)

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx} = f'(g(x)) \cdot g'(x)$$

For a composition of $L$ functions $y = f_L(f_{L-1}(\cdots f_1(x) \cdots))$:

$$\frac{dy}{dx} = \frac{df_L}{df_{L-1}} \cdot \frac{df_{L-1}}{df_{L-2}} \cdots \frac{df_2}{df_1} \cdot \frac{df_1}{dx} = \prod_{l=1}^{L} f_l'$$

This product of $L$ derivatives is exactly what happens during backpropagation through an $L$-layer network. Each layer contributes one factor to the product. If any factor is consistently less than 1 (sigmoid, tanh in saturated regime), the product vanishes exponentially — this is the vanishing gradient problem. If any factor is consistently greater than 1, the product explodes — the exploding gradient problem.

The Chain Rule (Vector Case)

If $\bm{y} = f(\bm{g}(\bm{x}))$ where $\bm{x} \in \R^n$, $\bm{g}: \R^n \to \R^m$, and $f: \R^m \to \R$ (scalar loss), then:

$$\pd{f}{\bm{x}} = \left(\pd{\bm{g}}{\bm{x}}\right)^\top \pd{f}{\bm{g}}$$

where $\pd{\bm{g}}{\bm{x}} \in \R^{m \times n}$ is the Jacobian matrix and $\pd{f}{\bm{g}} \in \R^m$ is the gradient of $f$ with respect to $\bm{g}$.

In neural network terms: the loss $L$ is a scalar, and we want $\pd{L}{\bm{x}}$ for each layer's inputs. We multiply the incoming gradient $\pd{L}{\bm{g}} \in \R^m$ (from the layer above) by the transposed Jacobian of the current layer. This is a matrix-vector product costing $O(mn)$ FLOPs.

FORWARD PASS x z=Wx+b a=σ(z) ... L BACKWARD PASS (reverse order) ∂L/∂L=1 ∂L/∂a ∂L/∂z =∂L/∂a⊙σ'(z) ∂L/∂W, ∂L/∂b ∂L/∂x=Wᵀ∂L/∂z ∂L/∂x cache x cache z
Figure 2.1. Forward and backward pass through a single layer. The forward pass caches intermediate values ($\bm{x}$, $\bm{z}$) that are reused during the backward pass. Each backward step involves a matrix-vector product (chain rule application) and element-wise operations.

2.3 Jacobians and Hessians

Jacobian Matrix

For a function $\bm{f}: \R^n \to \R^m$, the Jacobian $\bm{J} \in \R^{m \times n}$ is the matrix of all first-order partial derivatives:

$$J_{ij} = \pd{f_i}{x_j}$$ $$\bm{J} = \begin{bmatrix} \pd{f_1}{x_1} & \cdots & \pd{f_1}{x_n} \\ \vdots & \ddots & \vdots \\ \pd{f_m}{x_1} & \cdots & \pd{f_m}{x_n} \end{bmatrix}$$

Storage cost: $m \times n$ elements. Computation cost: depends on the function, but generally $O(mn)$ FLOPs.

Jacobians in Neural Networks

Every layer in a neural network defines a function from its inputs to its outputs. The Jacobian of this function captures how each output changes with respect to each input. For a fully-connected layer $\bm{z} = \bm{W}\bm{x} + \bm{b}$, the Jacobian with respect to $\bm{x}$ is simply $\bm{W}$ itself (since the function is linear). For an element-wise activation $\bm{a} = \sigma(\bm{z})$, the Jacobian is the diagonal matrix $\text{diag}(\sigma'(\bm{z}))$.

Key Insight — Diagonal Jacobians Are Cheap

For element-wise functions like activations, the Jacobian is diagonal. Multiplying by a diagonal matrix is equivalent to element-wise multiplication: $\text{diag}(\bm{d}) \cdot \bm{v} = \bm{d} \odot \bm{v}$, costing only $n$ FLOPs instead of $n^2$ for a full matrix-vector product. This is why activation function backward passes are cheap — they use Hadamard products, not full matrix multiplications.

Hessian Matrix

For a scalar function $f: \R^n \to \R$, the Hessian $\bm{H} \in \R^{n \times n}$ is the matrix of second-order partial derivatives:

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Storage: $n^2$ elements ($n(n+1)/2$ if symmetric). Computation: $O(n^2)$ for all elements.

The Hessian encodes the curvature of the loss surface. Its eigenvalues determine whether a critical point is a local minimum (all positive), saddle point (mixed), or local maximum (all negative). As noted earlier, for a network with $P$ parameters, the Hessian is a $P \times P$ matrix — far too large to store or compute. However, Hessian-vector products $\bm{H}\bm{v}$ can be computed in $O(P)$ time using a technique called "Pearlmutter's trick" (Pearlmutter, 1994), which requires only one additional forward and backward pass. This enables methods like conjugate gradient without ever forming the full Hessian.

References for §2.1–2.3

[1] Goodfellow et al. (2016). Deep Learning, Ch. 4: Numerical Computation (gradients, Hessians, optimization).

[6] Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1), 147–160.

[7] Magnus, J. R. & Neudecker, H. (2019). Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd Ed. Wiley. — Rigorous reference for matrix calculus.

2.4 Forward-Mode vs. Reverse-Mode Automatic Differentiation

This section explains why backpropagation works the way it does — and why it is computationally optimal for neural networks. The insight is purely about the order in which we apply the chain rule, but it has enormous computational implications.

Consider a function $f: \R^n \to \R^m$ composed of elementary operations. We want to compute the Jacobian $\bm{J} \in \R^{m \times n}$.

Two Modes of Automatic Differentiation

Forward mode: Computes one column of the Jacobian per pass. We pick an input direction $\bm{v} \in \R^n$ and compute the Jacobian-vector product $\bm{J}\bm{v} \in \R^m$ in one forward sweep. To get the full Jacobian, we need $n$ passes (one per input dimension).

$$\text{Cost for full Jacobian} = O(n) \times \text{cost of one forward pass}$$

Reverse mode: Computes one row of the Jacobian per pass. We pick an output direction $\bm{u} \in \R^m$ and compute the vector-Jacobian product $\bm{u}^\top\bm{J} \in \R^n$ in one backward sweep. To get the full Jacobian, we need $m$ passes (one per output dimension).

$$\text{Cost for full Jacobian} = O(m) \times \text{cost of one backward pass}$$

The Critical Insight for Neural Networks

In neural network training, the function we differentiate is the loss function $L: \R^P \to \R$, which maps $P$ parameters to a single scalar loss. Here $m = 1$ and $n = P$ (which can be millions or billions).

Forward mode would require $P$ passes → completely infeasible.

Reverse mode requires $m = 1$ pass → one backward pass computes the entire gradient $\nabla L \in \R^P$.

This is why backpropagation (reverse-mode AD) costs only about 2–3× the cost of a single forward pass, regardless of the number of parameters. This is the fundamental reason deep learning is computationally tractable.

Forward Mode AD One pass → one column of J x₁ x₂ x₃ → x₄ n inputs 1 pass f(x) ∂f₁/∂x₃ ∂f₂/∂x₃ ∂f₃/∂x₃ column 3 of J Need n passes! Reverse Mode AD (Backprop) One pass → one row of J f₁ ← L f₃ m outputs 1 pass f(x) ∂L/∂x₁ ∂L/∂x₂ ∂L/∂x₃ ∂L/∂x₄ FULL gradient ∇L Just 1 pass! For loss L: ℝᴾ → ℝ (m=1, n=P): Forward mode costs O(P) passes. Reverse mode costs O(1) pass. This is why all of deep learning uses reverse-mode AD (backpropagation).
Figure 2.2. Forward-mode vs. reverse-mode automatic differentiation. For a scalar loss function with $P$ parameters, reverse mode computes the entire gradient in a single backward pass — making it $P$ times more efficient than forward mode.
Computational Cost of Backpropagation

The cost of one backward pass through the entire network is:

$$\text{FLOPs}_{\text{backward}} \approx 2 \times \text{FLOPs}_{\text{forward}}$$

This factor of 2 arises because each layer requires two matrix multiplications in the backward pass (one for $\partial L / \partial \bm{W}$ and one for $\partial L / \partial \bm{x}$), compared to one in the forward pass. The total cost of one training step is:

$$\text{FLOPs}_{\text{total}} = \text{FLOPs}_{\text{forward}} + \text{FLOPs}_{\text{backward}} \approx 3 \times \text{FLOPs}_{\text{forward}}$$

This is a well-established empirical rule confirmed by profiling studies (Sze et al., 2017).

References for §2.4

[8] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. — The seminal backpropagation paper.

[9] Griewank, A. & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd Ed. SIAM. — The comprehensive reference on automatic differentiation.

[10] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18(153), 1–43.

2.5 Optimization Landscape Geometry

Understanding the geometry of the loss surface is essential for understanding why certain optimizers converge faster (requiring fewer total FLOPs to reach a good solution) and why certain architectures are easier to train.

Critical Points

A critical point of $L(\bm{\theta})$ satisfies $\nabla L = \bm{0}$. The nature of the critical point is determined by the Hessian $\bm{H}$:

Eigenvalues of H Type Behavior
All positive (positive definite) Local minimum Loss increases in all directions
All negative (negative definite) Local maximum Loss decreases in all directions
Mixed positive and negative Saddle point Loss increases in some directions, decreases in others

The Saddle Point Problem

A critical result by Dauphin et al. (2014) showed that in high-dimensional loss surfaces (typical for neural networks), saddle points vastly outnumber local minima. For a function in $P$ dimensions, at a random critical point, each Hessian eigenvalue is independently positive or negative with roughly equal probability. The probability of all $P$ eigenvalues being positive (a local minimum) is approximately $2^{-P}$ — exponentially small. This means that for networks with millions of parameters, the optimizer almost never encounters true local minima; instead, it must navigate past saddle points.

Condition Number and Convergence

Condition Number

The condition number of the Hessian is $\kappa = \lambda_{\max} / \lambda_{\min}$, where $\lambda_{\max}$ and $\lambda_{\min}$ are the largest and smallest positive eigenvalues. For gradient descent on a quadratic loss (locally approximating any smooth loss):

$$\text{Convergence rate} \propto \left(\frac{\kappa - 1}{\kappa + 1}\right)^t$$

A condition number of $\kappa = 100$ means convergence takes roughly $100\times$ as many steps as $\kappa = 1$. Since each step has a fixed FLOP cost, the condition number directly determines the total FLOPs required for training.

This is why techniques like Batch Normalization (which reduces the condition number — Santurkar et al., 2018), proper weight initialization (which starts the optimization in a well-conditioned region), and adaptive learning rate methods like Adam (which implicitly estimate curvature) are so important from a computational efficiency perspective — they reduce the total number of training steps needed.

References for §2.5

[11] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NeurIPS 2014.

[12] Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How Does Batch Normalization Help Optimization? NeurIPS 2018.

[13] Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. NeurIPS 2018.

[1] Goodfellow et al. (2016). Deep Learning, Ch. 4.3: Gradient-Based Optimization, Ch. 8.2: Challenges in Neural Network Optimization.


§3 Probability and Information Theory

Probability theory provides the framework for understanding what neural networks are actually learning (a probability distribution), and information theory provides the natural loss functions for measuring how well they learn. Every classification network trained with cross-entropy loss is performing maximum likelihood estimation — this connection is fundamental and has direct computational implications.

3.1 Probability Distributions in Neural Networks

Bernoulli Distribution (Binary Classification)

For a single binary output $y \in \{0, 1\}$ with predicted probability $\hat{y} = p$:

$$P(y \mid p) = p^y (1-p)^{1-y}$$

A neural network performing binary classification outputs a single value $\hat{y} = \sigma(z) \in (0, 1)$ through a sigmoid activation. The output is interpreted as the Bernoulli parameter $p$.

Categorical Distribution (Multi-class Classification)

For a classification problem with $C$ classes, the network output $\hat{\bm{y}} = \text{softmax}(\bm{z}) \in \R^C$ defines a categorical distribution:

$$P(y = c \mid \hat{\bm{y}}) = \hat{y}_c = \frac{e^{z_c}}{\sum_{j=1}^{C} e^{z_j}}$$
Softmax — Computational Analysis

For logits $\bm{z} \in \R^C$:

Step 1: Compute $\max(\bm{z})$ for numerical stability — $C - 1$ comparisons

Step 2: Subtract max: $z_c \leftarrow z_c - \max(\bm{z})$ — $C$ subtractions

Step 3: Exponentiate: $e^{z_c}$ for each $c$ — $C$ exponentials

Step 4: Sum: $S = \sum_c e^{z_c}$ — $C - 1$ additions

Step 5: Normalize: $\hat{y}_c = e^{z_c} / S$ — $C$ divisions

Total: $\approx 5C$ FLOPs (dominated by $C$ exponentials, which are the most expensive elementary operations)

Memory: $C$ elements for the output $\hat{\bm{y}}$ (stored for backpropagation)

Gaussian Distribution (Regression)

For regression tasks, the network output can be interpreted as the mean $\mu$ (and optionally variance $\sigma^2$) of a Gaussian: $P(y \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mu)^2}{2\sigma^2}\right)$. Under maximum likelihood with fixed variance, this leads directly to the MSE loss.

3.2 Information-Theoretic Loss Functions

Entropy

The entropy of a discrete probability distribution $p$ measures its uncertainty:

$$H(p) = -\sum_{x} p(x) \log p(x)$$

Computational cost: $C$ logarithms + $C$ multiplications + $(C-1)$ additions + 1 negation = $\approx 3C$ FLOPs for $C$ classes.

Entropy is maximized for the uniform distribution ($H = \log C$) and minimized (= 0) for a point mass.

Cross-Entropy

The cross-entropy between the true distribution $p$ and the predicted distribution $q$ is:

$$H(p, q) = -\sum_{x} p(x) \log q(x)$$

In classification, $p$ is typically a one-hot vector (with $p(y_{\text{true}}) = 1$ and all other entries 0), so:

$$H(p, q) = -\log q(y_{\text{true}})$$

Computational cost with one-hot labels: Only 1 logarithm + 1 negation = 2 FLOPs per sample (!)

For soft labels (e.g., label smoothing, knowledge distillation): $\approx 3C$ FLOPs per sample.

KL Divergence

The KL divergence from distribution $p$ to distribution $q$ is:

$$D_{\text{KL}}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p)$$

Since $H(p)$ is a constant (determined by the data, not the model), minimizing cross-entropy $H(p, q)$ is equivalent to minimizing KL divergence $D_{\text{KL}}(p \| q)$. This means the computational cost of training with cross-entropy loss or KL divergence loss is identical.

Relationship: Entropy, Cross-Entropy, and KL Divergence H(p, q) = Cross-Entropy H(p) Entropy of true distribution (constant, not learnable) D_KL(p ‖ q) KL Divergence (what training minimizes) H(p, q) = H(p) + D_KL(p ‖ q) → minimizing H(p,q) ≡ minimizing D_KL
Figure 3.1. The decomposition of cross-entropy into entropy and KL divergence. Since $H(p)$ is a property of the data distribution (constant during training), minimizing cross-entropy is equivalent to minimizing KL divergence. This has zero extra computational cost — same loss function, same gradients.

3.3 Maximum Likelihood Estimation and the Connection to Cross-Entropy

This connection is one of the most important theoretical results underpinning neural network training. It answers the question: why do we use cross-entropy as the loss function?

Maximum Likelihood ↔ Cross-Entropy Equivalence

Given a dataset $\{(\bm{x}_i, y_i)\}_{i=1}^N$ and a model parameterized by $\bm{\theta}$ that outputs probability $P_\theta(y \mid \bm{x})$, the maximum likelihood estimator is:

$$\hat{\bm{\theta}}_{\text{MLE}} = \arg\max_{\bm{\theta}} \prod_{i=1}^N P_\theta(y_i \mid \bm{x}_i)$$

Taking the log and negating (converting max to min):

$$\hat{\bm{\theta}}_{\text{MLE}} = \arg\min_{\bm{\theta}} \left(-\frac{1}{N}\sum_{i=1}^N \log P_\theta(y_i \mid \bm{x}_i)\right)$$

This is exactly the cross-entropy loss. The negative log-likelihood (NLL) averaged over the dataset equals the cross-entropy between the empirical data distribution and the model's predicted distribution.

The Softmax + Cross-Entropy Gradient Simplification

One of the most computationally significant results in neural network implementation is the simplification of the gradient when softmax output is combined with cross-entropy loss.

Derivation — Softmax + Cross-Entropy Gradient

Let $\hat{y}_c = \text{softmax}(z_c) = \frac{e^{z_c}}{\sum_j e^{z_j}}$ and $L = -\log \hat{y}_k$ where $k$ is the true class.

We need $\pd{L}{z_c}$ for all $c$:

Case 1: $c = k$ (the true class):

$$\pd{L}{z_k} = -\frac{1}{\hat{y}_k} \cdot \hat{y}_k(1 - \hat{y}_k) = -(1 - \hat{y}_k) = \hat{y}_k - 1$$

Case 2: $c \neq k$ (all other classes):

$$\pd{L}{z_c} = -\frac{1}{\hat{y}_k} \cdot (-\hat{y}_k \hat{y}_c) = \hat{y}_c$$

Unified:

$$\pd{L}{\bm{z}} = \hat{\bm{y}} - \bm{y}$$

where $\bm{y}$ is the one-hot encoded true label vector.

Computational cost of this gradient: $C$ subtractions = $C$ FLOPs. No logarithms, no divisions, no Jacobian computation — just a simple vector subtraction. This is dramatically cheaper than computing the full softmax Jacobian ($C^2$ elements) and multiplying by the loss gradient.

Why This Matters for Computational Efficiency

Without this simplification, the backward pass through a softmax layer with $C$ classes would require computing the full $C \times C$ Jacobian matrix of softmax (cost: $O(C^2)$ FLOPs and memory), then multiplying it by the gradient vector (cost: $O(C^2)$ FLOPs). With the simplification, the cost drops to $O(C)$. For ImageNet ($C = 1000$), this is a $1000\times$ reduction. For language models ($C = 50{,}000+$ vocabulary tokens), this is a $50{,}000\times$ reduction. All modern deep learning frameworks implement this fused softmax-cross-entropy backward pass.

References for §3

[1] Goodfellow et al. (2016). Deep Learning, Ch. 3: Probability and Information Theory; Ch. 5.5: Maximum Likelihood Estimation; Ch. 6.2.2: Cost Functions.

[14] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Ch. 1.6 (Information Theory), Ch. 4.3.4 (Softmax).

[15] Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory, 2nd Ed. Wiley. — The definitive reference on information theory.

[16] Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 5 (Decision Theory), Ch. 10 (MLE). Freely available online.


§ Computational Cost Summary — All Part I Operations

The following table consolidates every computational cost derived in this document. This serves as a quick reference for analyzing the efficiency of any neural network component built from these primitives.

Operation FLOPs Memory (elements) Notes
Dot product ($\bm{x}^\top\bm{y}$, length $n$) $2n$ 1 (scalar output) Building block of matmul
Element-wise multiply ($\bm{x} \odot \bm{y}$, length $n$) $n$ $n$ Activation backward pass
Outer product ($\bm{x}\bm{y}^\top$, $m \times n$) $mn$ $mn$ Weight gradient computation
Matrix-vector ($\bm{A}\bm{x}$, $m \times n$) $2mn$ $m$ Single-sample FC layer
Matrix multiply ($\bm{A}\bm{B}$, $m \times n \times p$) $2mnp$ $mp$ Core operation of DL
Transpose $O(1)$ 0 (metadata only) Stride manipulation
$L^2$ norm (length $n$) $2n + 1$ 1 Gradient clipping
Frobenius norm ($m \times n$ matrix) $2mn + 1$ 1 L2 regularization
Softmax ($C$ classes) $\sim5C$ $C$ Includes numerical stability
Cross-entropy (one-hot) $2$ 1 Only 1 log needed
Softmax + CE gradient (fused) $C$ $C$ $\hat{\bm{y}} - \bm{y}$
Entropy ($C$ classes) $\sim3C$ 1 $C$ logs + $C$ multiplies
KL divergence ($C$ classes) $\sim5C$ 1 $C$ logs + $C$ divides + $C$ multiplies
Full Jacobian ($f: \R^n \to \R^m$) $O(mn)$ $mn$ Rarely computed in full
Hessian ($f: \R^n \to \R$) $O(n^2)$ $n^2$ Infeasible for large $n$
Hessian-vector product $O(n)$ $O(n)$ Pearlmutter's trick
Forward pass (full network) $F$ All activations Defined by architecture
Backward pass (full network) $\sim2F$ Gradients + cached activations Reverse-mode AD
Forward + backward $\sim3F$ Params + grads + activations One training iteration

§ Complete References

Textbooks

[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 2 (Linear Algebra), 3 (Probability), 4 (Numerical Computation). Available at deeplearningbook.org.

[2] Strang, G. (2016). Introduction to Linear Algebra, 5th Ed. Wellesley-Cambridge Press.

[3] Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations, 4th Ed. Johns Hopkins University Press.

[7] Magnus, J. R. & Neudecker, H. (2019). Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd Ed. Wiley.

[9] Griewank, A. & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd Ed. SIAM.

[14] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

[15] Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory, 2nd Ed. Wiley.

[16] Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

Research Papers

[4] Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR 2014.

[5] Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. ICML 2013.

[6] Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1), 147–160.

[8] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.

[10] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. JMLR, 18(153), 1–43.

[11] Dauphin, Y. N., et al. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NeurIPS 2014.

[12] Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How Does Batch Normalization Help Optimization? NeurIPS 2018.

[13] Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. NeurIPS 2018.

Surveys

[3*] Sze, V., Chen, Y., Yang, T., & Emer, J. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, 105(12), 2295–2329. — Comprehensive FLOP and memory formulas for all DNN layer types.

← Back to Notebook Index
Total visits:
§
Page visits: