Part III of VI

Convolutional Neural Networks — Complete Dissection

The convolution operation, every layer type, receptive field theory, landmark architectures, and full computational cost analysis — from first principles to EfficientNet

Contents

§8 The Convolution Operation

The convolution operation is the defining computational primitive of CNNs. It replaces the general matrix multiplication of fully-connected layers with a structured, sparse computation that exploits the spatial structure of visual data. Understanding exactly how convolution works — and how it maps to hardware — is essential for computational efficiency analysis.

8.1 Mathematical Definition

Continuous Convolution
$$(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)\, g(t - \tau)\, d\tau$$
Discrete 2D Cross-Correlation (Used in Deep Learning)

Given an input feature map $\bm{I} \in \R^{H_{\text{in}} \times W_{\text{in}}}$ and a kernel $\bm{K} \in \R^{k_h \times k_w}$:

$$(\bm{I} * \bm{K})(i, j) = \sum_{m=0}^{k_h - 1} \sum_{n=0}^{k_w - 1} \bm{I}(i + m,\; j + n) \cdot \bm{K}(m, n)$$

Technically this is cross-correlation, not convolution (true convolution flips the kernel: $\bm{K}(k_h - 1 - m, k_w - 1 - n)$). Deep learning uses cross-correlation but universally calls it "convolution." Since kernels are learned, the flip is irrelevant — the network simply learns the flipped version.

Multi-Channel Convolution (Full Form)

In practice, inputs have $C_{\text{in}}$ channels and we produce $C_{\text{out}}$ output channels. The full convolution for one output pixel at position $(i,j)$ of output channel $c_{\text{out}}$:

$$\bm{O}(c_{\text{out}}, i, j) = \sum_{c=0}^{C_{\text{in}}-1} \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} \bm{I}(c, i \cdot s + m, j \cdot s + n) \cdot \bm{K}(c_{\text{out}}, c, m, n) + b_{c_{\text{out}}}$$

where $s$ is the stride and $\bm{K} \in \R^{C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w}$ is the 4D kernel tensor.

Input (5×5, 1 channel) 10101 01010 11100 00110 01011 * Kernel (3×3) 101 010 101 = Output (3×3) 432 242 234 Computing output[0,0] = 4: 1·1 + 0·0 + 1·1 + 0·0 + 1·1 + 0·0 + 1·1 + 1·0 + 1·1 = 4 Per output element: k²=9 mults + 8 adds = 2k² - 1 ≈ 2k² FLOPs
Figure 8.1. 2D convolution (cross-correlation) of a 5×5 input with a 3×3 kernel, stride=1, no padding. Output is 3×3. Each output element is a dot product of the kernel with a local patch of the input, costing $2k^2$ FLOPs.

8.2 Three Key Properties: Sparse Connectivity, Weight Sharing, Equivariance

These three properties explain why CNNs are dramatically more parameter-efficient and computationally efficient than fully-connected networks for spatial data.

Property 1 — Sparse Connectivity (Local Receptive Fields)

Each output neuron connects to only a $k \times k$ local patch of the input, not the entire input. For an input of size $H \times W$:

FC layer connections per output neuron: $H \times W \times C_{\text{in}}$

Conv layer connections per output neuron: $k^2 \times C_{\text{in}}$

Ratio: $\frac{k^2}{H \times W}$. For a 3×3 kernel on 224×224 input: $\frac{9}{50{,}176} = 0.018\%$

This alone reduces parameters by a factor of ~5,500 for this example.

Property 2 — Weight Sharing

The same kernel is applied at every spatial position. One filter has $k^2 \times C_{\text{in}}$ parameters regardless of input resolution.

FC layer to produce one output channel from 224×224×3 input: $224 \times 224 \times 3 = 150{,}528$ parameters per output neuron

Conv 3×3 filter for same: $3 \times 3 \times 3 = 27$ parameters per output channel

Weight sharing also makes CNNs naturally resolution-independent — the same learned filters work on any input size.

Property 3 — Translation Equivariance

If the input shifts by $(dx, dy)$, the output shifts by the same amount (adjusted for stride). Formally:

$$f(\text{shift}(\bm{x})) = \text{shift}(f(\bm{x}))$$

This means the network doesn't need to learn separate detectors for a feature at every location — one kernel detects the feature everywhere. This property comes directly from weight sharing and has no additional computational cost.

CNN vs. FC — Parameter Efficiency Example

Processing a 224×224×3 image to produce 64 feature maps:

Approach Parameters Ratio
Fully connected (flatten → 64) $224 \times 224 \times 3 \times 64 = 9{,}633{,}792$
Conv 3×3 (3 → 64 channels) $3 \times 3 \times 3 \times 64 + 64 = 1{,}792$ 5,375× fewer
Conv 7×7 (3 → 64 channels) $7 \times 7 \times 3 \times 64 + 64 = 9{,}472$ 1,017× fewer

8.3 Convolution as Matrix Multiplication (im2col)

Although convolution is defined as a sliding-window dot product, in practice it is never computed that way on GPUs. Instead, the input is rearranged into a matrix using the im2col (image-to-column) transformation, converting convolution into a single large matrix multiplication. This is the standard approach used by cuDNN, MKL-DNN, and all major deep learning frameworks.

The im2col Transformation

For input $\bm{I} \in \R^{C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}}}$ and kernel size $k \times k$, stride $s$:

Step 1 — im2col: Extract every $k \times k \times C_{\text{in}}$ patch from $\bm{I}$ and flatten each into a column vector. This produces a matrix:

$$\bm{I}_{\text{col}} \in \R^{(C_{\text{in}} \cdot k^2) \times (H_{\text{out}} \cdot W_{\text{out}})}$$

Step 2 — Reshape kernel: Reshape $\bm{K} \in \R^{C_{\text{out}} \times C_{\text{in}} \times k \times k}$ into:

$$\bm{K}_{\text{row}} \in \R^{C_{\text{out}} \times (C_{\text{in}} \cdot k^2)}$$

Step 3 — Matrix multiply:

$$\bm{O}_{\text{col}} = \bm{K}_{\text{row}} \cdot \bm{I}_{\text{col}} \in \R^{C_{\text{out}} \times (H_{\text{out}} \cdot W_{\text{out}})}$$

Step 4 — Reshape output: Reshape $\bm{O}_{\text{col}}$ back to $\R^{C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}$

Input I (C_in × H × W) im2col I_col (C_in·k² × H_out·W_out) each column = one patch C_in·k² K_row (C_out × C_in·k²) each row = one filter × = O_col → reshape → Output (C_out × H_out·W_out) im2col Computational Analysis GEMM FLOPs: 2 · C_out · (C_in·k²) · (H_out·W_out) = same as direct convolution FLOPs ✓ im2col memory overhead: C_in·k²·H_out·W_out = up to k² × original input size (significant!)
Figure 8.2. The im2col transformation converts convolution into a matrix multiplication. Each $k \times k \times C_{\text{in}}$ input patch is unrolled into a column of $\bm{I}_{\text{col}}$. The FLOP count is identical to direct convolution, but the operation can now leverage highly optimized GEMM routines. The trade-off is $O(k^2)$ memory overhead.
im2col Memory Overhead

Original input memory: $C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}}$ elements

im2col matrix memory: $C_{\text{in}} \times k^2 \times H_{\text{out}} \times W_{\text{out}}$ elements

For stride-1 convolution ($H_{\text{out}} \approx H_{\text{in}}$), the overhead is a factor of approximately $k^2$. For a 3×3 kernel, im2col uses ~9× more memory than the original input. For 7×7: ~49×. This is the fundamental memory-for-speed trade-off in CNN computation.

8.4 Alternative Convolution Algorithms

Algorithm FLOPs Extra Memory Best For
Direct (nested loops) $2 C_{\text{out}} C_{\text{in}} k^2 H_{\text{out}} W_{\text{out}}$ None Small layers, memory-limited
im2col + GEMM Same $O(k^2 \times \text{input})$ Most cases (standard approach)
FFT-based $O(C_{\text{out}} C_{\text{in}} H W \log(HW))$ $O(HW)$ per channel Large kernels ($k \geq 7$)
Winograd Reduces mults by up to 2.25× for 3×3 Moderate transform overhead 3×3 kernels specifically

The Winograd algorithm (Lavin & Gray, 2016) is particularly important for modern CNNs that use almost exclusively 3×3 kernels. For $F(2 \times 2, 3 \times 3)$ (computing a 2×2 output tile with a 3×3 kernel), it reduces the multiplications from 36 to 16 per tile — a 2.25× reduction. cuDNN automatically selects the best algorithm for each layer.

References for §8

[1] Goodfellow et al. (2016). Deep Learning, Ch. 9: Convolutional Networks.

[2] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proc. IEEE, 86(11), 2278–2324.

[3] Chellapilla, K., Puri, S., & Simard, P. (2006). High Performance Convolutional Neural Networks for Document Processing. IWFHR 2006. — First use of im2col for CNNs.

[4] Mathieu, M., Henaff, M., & LeCun, Y. (2014). Fast Training of Convolutional Networks through FFTs. ICLR 2014.

[5] Lavin, A. & Gray, S. (2016). Fast Algorithms for Convolutional Neural Networks. CVPR 2016. — Winograd for CNNs.


§9 CNN Layer Types — Detailed Computation

9.1 Standard Convolutional Layer

Standard Conv2D — Complete Specification

Input: $\bm{X} \in \R^{B \times C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}}}$

Kernel: $\bm{W} \in \R^{C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w}$

Bias: $\bm{b} \in \R^{C_{\text{out}}}$

Hyperparameters: stride $s$, padding $p$, dilation $d$

Output Dimensions
$$H_{\text{out}} = \left\lfloor\frac{H_{\text{in}} + 2p - d(k_h - 1) - 1}{s}\right\rfloor + 1$$ $$W_{\text{out}} = \left\lfloor\frac{W_{\text{in}} + 2p - d(k_w - 1) - 1}{s}\right\rfloor + 1$$

For the common case of $d=1$ and $k_h = k_w = k$:

$$H_{\text{out}} = \left\lfloor\frac{H_{\text{in}} + 2p - k}{s}\right\rfloor + 1$$
Standard Conv2D — Complete Computational Cost
Metric Formula
Parameters $C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w + C_{\text{out}}$
Parameter memory (float32) $\text{params} \times 4$ bytes
FLOPs per output element $2 \times C_{\text{in}} \times k_h \times k_w$ (one MAC per kernel element)
Total output elements $B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}$
Total forward FLOPs $\boxed{2 \times B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}} \times C_{\text{in}} \times k_h \times k_w}$
Output activation memory $B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}} \times 4$ bytes
Input cache for backprop $B \times C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}} \times 4$ bytes
Backward FLOPs $\approx 2 \times$ forward FLOPs
Worked Example — Conv2D Layer

A conv layer: 64 input channels, 128 output channels, 3×3 kernel, stride 1, padding 1, input 56×56, batch 32.

Metric Computation Value
Output size $\lfloor(56 + 2 - 3)/1\rfloor + 1$ 56 × 56
Parameters $128 \times 64 \times 3 \times 3 + 128$ 73,856
Param memory $73{,}856 \times 4$ 295 KB
FLOPs (forward) $2 \times 32 \times 128 \times 56 \times 56 \times 64 \times 9$ 14.8 GFLOPs
Output memory $32 \times 128 \times 56 \times 56 \times 4$ 51.4 MB
Input cache $32 \times 64 \times 56 \times 56 \times 4$ 25.7 MB

Note: 73K parameters produce 14.8 billion FLOPs — a ratio of 200,000 FLOPs per parameter. This extreme ratio (compared to ~2 FLOPs per parameter for FC layers) is due to weight sharing: each kernel weight contributes to $H_{\text{out}} \times W_{\text{out}} = 3{,}136$ output positions.

9.2 Padding, Stride, and Dilation

Padding

Type Value of $p$ Effect on Output Size Use Case
Valid (no padding) $p = 0$ Output shrinks by $k-1$ in each dimension When output size reduction is acceptable
Same padding $p = \lfloor k/2 \rfloor$ $H_{\text{out}} = H_{\text{in}}$ (for $s=1$) Most common — preserves spatial dimensions
Full padding $p = k - 1$ Output grows Transposed convolution

Zero-padding has essentially zero computational cost — it is implemented via memory indexing rather than actually writing zeros.

Stride

Stride $s > 1$ reduces the output spatial dimension by a factor of $s$ in each dimension, reducing the total FLOPs by a factor of $s^2$:

$$\frac{\text{FLOPs}(s>1)}{\text{FLOPs}(s=1)} \approx \frac{1}{s^2}$$

Stride-2 convolution is often used as an alternative to pooling for downsampling (Springenberg et al., 2015), with the advantage that it learns the downsampling rather than using a fixed rule.

Dilation (Atrous Convolution)

Dilation $d$ inserts $d-1$ zeros between kernel elements, creating an effective kernel size of $k_{\text{eff}} = k + (k-1)(d-1)$ without adding any parameters. For a 3×3 kernel with $d=2$, the effective receptive field is 5×5 using only 9 parameters (vs. 25 for an actual 5×5 kernel).

Dilation — Computational Impact

Parameters: unchanged — $C_{\text{out}} \times C_{\text{in}} \times k^2$

FLOPs per output element: unchanged — $2 \times C_{\text{in}} \times k^2$ (same number of multiplications)

Effective receptive field: $k_{\text{eff}} = k + (k-1)(d-1)$

Dilation is computationally "free" in terms of FLOPs — it expands the receptive field without increasing computation.

9.3 Pooling Layers

Pooling — Zero Learnable Parameters

Pooling layers have no learnable parameters. They reduce spatial dimensions by applying a fixed function over local windows.

Max Pooling

Max Pool Computation
$$\text{MaxPool}(i, j) = \max_{m,n \in [0, k)} \bm{X}(i \cdot s + m, j \cdot s + n)$$
Metric Formula
Parameters 0
FLOPs (forward) $(k^2 - 1)$ comparisons per output element × $B \times C \times H_{\text{out}} \times W_{\text{out}}$
Backward Gradient routed only to the max element; need to store index of max for each window
Index memory $B \times C \times H_{\text{out}} \times W_{\text{out}}$ integers (typically 4 bytes each)

For the standard 2×2 max pool with stride 2: 3 comparisons per output element. With $s = 2$, the output has $1/4$ the spatial elements, so pooling reduces subsequent layer FLOPs by 4×.

Average Pooling

Average Pool Computation
$$\text{AvgPool}(i, j) = \frac{1}{k^2}\sum_{m,n \in [0, k)} \bm{X}(i \cdot s + m, j \cdot s + n)$$

FLOPs: $(k^2 - 1)$ additions + 1 division per output element. Backward: gradient distributed equally ($1/k^2$) to all input elements in the window — no index storage needed.

Global Average Pooling (GAP)

GAP — Replaces Fully-Connected Layers

Averages each channel over the entire spatial extent:

$$\text{GAP}(c) = \frac{1}{H \times W}\sum_{i=0}^{H-1}\sum_{j=0}^{W-1}\bm{X}(c, i, j)$$

Output: $B \times C \times 1 \times 1$ — one value per channel per sample

FLOPs: $B \times C \times (HW - 1)$ additions ≈ $B \times C \times HW$

Key insight: GAP eliminates the need for large FC layers at the end of the network, saving enormous parameter counts. For example, replacing VGG-16's FC layers with GAP saves ~123M parameters.

Lin et al. (2014), "Network in Network"

9.4 1×1 Convolutions (Pointwise Convolutions)

1×1 Convolution

A convolution with $k = 1$. Kernel: $\bm{W} \in \R^{C_{\text{out}} \times C_{\text{in}} \times 1 \times 1}$. This applies a fully-connected layer independently at each spatial position — it mixes channels without any spatial interaction.

1×1 Conv — Computational Cost
Metric Formula
Parameters $C_{\text{out}} \times C_{\text{in}} + C_{\text{out}}$
FLOPs $2 \times B \times C_{\text{out}} \times H \times W \times C_{\text{in}}$
Output size $B \times C_{\text{out}} \times H \times W$ (spatial dims unchanged)

1×1 convolutions are used for channel dimension reduction before expensive 3×3 or 5×5 convolutions. This is the key idea behind the Inception module and ResNet bottleneck blocks.

1×1 Conv as Dimensionality Reduction

Reducing from 256 to 64 channels before a 3×3 conv, feature map 56×56:

Without 1×1: 3×3 conv from 256→256: $2 \times 256 \times 56 \times 56 \times 256 \times 9 = 2.93$G FLOPs

With 1×1 bottleneck:

1×1 conv 256→64: $2 \times 64 \times 56 \times 56 \times 256 = 102.8$M FLOPs

3×3 conv 64→64: $2 \times 64 \times 56 \times 56 \times 64 \times 9 = 231.2$M FLOPs

1×1 conv 64→256: $2 \times 256 \times 56 \times 56 \times 64 = 102.8$M FLOPs

Total with bottleneck: 436.8M FLOPs — a 6.7× reduction!

9.5 Depthwise Separable Convolutions

This is one of the most important efficiency innovations in CNN design. It factorizes a standard convolution into two steps: a depthwise convolution (spatial filtering per channel) and a pointwise convolution (channel mixing via 1×1 conv).

Depthwise Separable Convolution

Step 1 — Depthwise Conv: Apply one $k \times k$ filter per input channel independently. Each channel is convolved separately.

Kernel: $\bm{W}_{\text{dw}} \in \R^{C_{\text{in}} \times 1 \times k \times k}$ — one filter per channel, not $C_{\text{out}}$ filters across all channels.

Output: $B \times C_{\text{in}} \times H_{\text{out}} \times W_{\text{out}}$ (same number of channels as input)

Step 2 — Pointwise Conv (1×1): Mix channels to produce $C_{\text{out}}$ output channels.

Kernel: $\bm{W}_{\text{pw}} \in \R^{C_{\text{out}} \times C_{\text{in}} \times 1 \times 1}$

Depthwise Separable — Full Cost Analysis
Standard Conv Depthwise Pointwise Separable Total
Parameters $C_o C_i k^2$ $C_i k^2$ $C_o C_i$ $C_i k^2 + C_o C_i$
FLOPs $2 C_o H_o W_o C_i k^2$ $2 C_i H_o W_o k^2$ $2 C_o H_o W_o C_i$ $2 H_o W_o C_i(k^2 + C_o)$

Reduction ratio (separable ÷ standard):

$$\frac{C_i(k^2 + C_o)}{C_o \cdot C_i \cdot k^2} = \frac{1}{C_o} + \frac{1}{k^2}$$

For $k = 3$, $C_o = 128$: $\frac{1}{128} + \frac{1}{9} \approx 0.119$ — an 8.4× reduction in both params and FLOPs.

For $k = 3$, $C_o = 256$: $\frac{1}{256} + \frac{1}{9} \approx 0.115$ — an 8.7× reduction.

Numerical Comparison — 128→128 channels, 3×3, 56×56
Standard Depthwise Separable Ratio
Parameters $128 \times 128 \times 9 = 147{,}456$ $128 \times 9 + 128 \times 128 = 17{,}536$ 8.4× fewer
FLOPs $2 \times 128 \times 3136 \times 128 \times 9 = 924$M $2 \times 3136 \times 128 \times (9 + 128) = 110$M 8.4× fewer

References for §9.5

[6] Howard, A. G. et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861.

[7] Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. CVPR 2017.

9.6 Grouped Convolutions

Grouped Convolution

Divide $C_{\text{in}}$ into $G$ groups of $C_{\text{in}}/G$ channels each. Each group is convolved independently to produce $C_{\text{out}}/G$ output channels. Results are concatenated.

Parameters: $\frac{C_{\text{out}} \times C_{\text{in}}}{G} \times k^2$ — reduced by factor $G$

FLOPs: reduced by factor $G$

Special cases: $G = 1$ → standard conv; $G = C_{\text{in}}$ → depthwise conv

Originally used by Krizhevsky et al. (2012) with $G=2$ to split computation across two GPUs.

9.7 Transposed Convolution (Deconvolution)

Transposed Convolution

Used for upsampling (decoder networks, GANs, segmentation). Inserts $s-1$ zeros between input elements, then performs a standard convolution.

$$H_{\text{out}} = (H_{\text{in}} - 1) \times s - 2p + k + p_{\text{out}}$$

Parameters: same as standard conv of same kernel size: $C_{\text{out}} \times C_{\text{in}} \times k^2$

FLOPs: similar order to standard conv on the (larger) output size

Dumoulin & Visin (2016), "A guide to convolution arithmetic for deep learning"

9.8 CNN Backward Pass

Backpropagation Through a Conv Layer

Given upstream gradient $\pd{L}{\bm{O}} \in \R^{B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}$:

(a) Gradient w.r.t. kernel weights:

$$\pd{L}{\bm{W}} = \text{conv2d}\!\left(\bm{X},\; \pd{L}{\bm{O}}\right)$$

Correlation of the input $\bm{X}$ with the upstream gradient. Uses the cached input from the forward pass.

FLOPs: $2 \times B \times C_{\text{out}} \times C_{\text{in}} \times k^2 \times H_{\text{out}} \times W_{\text{out}}$ — same as forward pass.

(b) Gradient w.r.t. biases:

$$\pd{L}{b_{c}} = \sum_{i,j} \pd{L}{O_{c,i,j}} \quad \text{(sum over spatial dims and batch)}$$

FLOPs: $B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}$ — negligible.

(c) Gradient w.r.t. input (for propagation to previous layer):

$$\pd{L}{\bm{X}} = \text{full\_conv2d}\!\left(\pd{L}{\bm{O}},\; \text{flip}(\bm{W})\right)$$

A "full" convolution (with padding $k-1$) of the upstream gradient with the rotated (180°) kernel. This is mathematically equivalent to the transposed convolution.

FLOPs: $2 \times B \times C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}} \times C_{\text{out}} \times k^2$ — approximately same as forward.

Total backward ≈ 2× forward FLOPs, consistent with the general rule for neural networks.

References for §9

[1] Goodfellow et al. (2016). Deep Learning, Ch. 9.1–9.5.

[8] Lin, M., Chen, Q., & Yan, S. (2014). Network in Network. ICLR 2014. — 1×1 convolutions and GAP.

[9] Springenberg, J. T. et al. (2015). Striving for Simplicity: The All Convolutional Net. ICLR Workshop 2015.

[10] Yu, F. & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016.

[11] Dumoulin, V. & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285. — Essential reference for all convolution types.

[12] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12).


§10 Receptive Field Analysis

10.1 Theoretical Receptive Field

Receptive Field (RF)

The receptive field of a neuron in layer $l$ is the region of the original input that can influence that neuron's value. It grows with depth as each layer aggregates information from a local neighborhood.

Receptive Field Growth Formula

For a stack of convolutional layers $l = 1, 2, \ldots, L$, each with kernel size $k_l$ and stride $s_l$:

$$\text{RF}_l = \text{RF}_{l-1} + (k_l - 1) \times \prod_{i=1}^{l-1} s_i$$

with $\text{RF}_0 = 1$. The product term $j_l = \prod_{i=1}^{l-1} s_i$ is called the jump — it represents how many input pixels correspond to one step in layer $l$'s output.

For layers with identical $k$ and $s=1$:

$$\text{RF}_L = 1 + L(k - 1) = L(k-1) + 1$$

For all 3×3 layers with stride 1: $\text{RF}_L = 2L + 1$

Receptive Field Growth — 3×3 Stack
Layers RF Parameters (per C→C)
1 × (3×3) 3 × 3 $9C^2$
2 × (3×3) 5 × 5 $18C^2$
3 × (3×3) 7 × 7 $27C^2$
5 × (3×3) 11 × 11 $45C^2$

Compare: one 7×7 kernel has RF = 7×7 but uses $49C^2$ parameters, while three 3×3 kernels achieve the same RF with only $27C^2$ parameters — 45% fewer parameters, plus 3 non-linearities instead of 1. This is the core insight of VGGNet (Simonyan & Zisserman, 2015).

10.2 Effective Receptive Field

The theoretical receptive field assumes that every pixel within the RF contributes equally. In practice, Luo et al. (2016) showed that the effective receptive field is much smaller — it follows a Gaussian distribution, with the center pixels contributing far more than the periphery. Typically, the effective RF is only about 30–50% of the theoretical RF in each dimension. This has important implications for architecture design: simply making the network deeper doesn't linearly increase the effective RF.

10.3 Small Kernels vs. Large Kernels — The VGGNet Insight

Quantitative Comparison for Same Receptive Field
Configuration RF Params FLOPs (per position) Non-linearities
1 × (5×5) 5 $25C^2$ $50C^2$ 1
2 × (3×3) 5 $18C^2$ $36C^2$ 2
1 × (7×7) 7 $49C^2$ $98C^2$ 1
3 × (3×3) 7 $27C^2$ $54C^2$ 3
1 × (11×11) 11 $121C^2$ $242C^2$ 1
5 × (3×3) 11 $45C^2$ $90C^2$ 5

In every case, the stack of 3×3 convolutions uses fewer parameters, fewer FLOPs, and adds more non-linearity. This is why virtually all modern CNN architectures use 3×3 kernels exclusively.

References for §10

[13] Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. (VGGNet)

[14] Luo, W. et al. (2016). Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. NeurIPS 2016.


§11 Landmark CNN Architectures — Computational Analysis

We now trace the evolution of CNN architectures, analyzing each for parameter count, FLOPs, and the key computational innovations they introduced. All FLOP counts are for a single forward pass on one image at the standard input resolution.

11.1 LeNet-5 (LeCun et al., 1998)

LeNet-5 Architecture

Input: 32×32×1 (grayscale)

Layers: Conv(6, 5×5, s=1) → AvgPool(2×2, s=2) → Conv(16, 5×5, s=1) → AvgPool(2×2, s=2) → FC(120) → FC(84) → FC(10)

Layer Output Size Params FLOPs
Conv1 (1→6, 5×5) 28×28×6 156 122K
AvgPool (2×2) 14×14×6 0 3.5K
Conv2 (6→16, 5×5) 10×10×16 2,416 483K
AvgPool (2×2) 5×5×16 0 1.2K
FC1 (400→120) 120 48,120 96K
FC2 (120→84) 84 10,164 20K
FC3 (84→10) 10 850 1.7K
Total ~61K ~727K

Historical significance: First successful CNN application (handwritten digit recognition). Established the Conv→Pool→Conv→Pool→FC pattern.

11.2 AlexNet (Krizhevsky et al., 2012)

AlexNet Architecture

Input: 227×227×3

Key innovations: ReLU activation, dropout, GPU training, data augmentation

Layer Output Size Params FLOPs
Conv1 (3→96, 11×11, s=4) 55×55×96 34,944 211M
MaxPool (3×3, s=2) 27×27×96 0
Conv2 (96→256, 5×5, p=2) 27×27×256 614,656 896M
MaxPool (3×3, s=2) 13×13×256 0
Conv3 (256→384, 3×3, p=1) 13×13×384 885,120 299M
Conv4 (384→384, 3×3, p=1) 13×13×384 1,327,488 449M
Conv5 (384→256, 3×3, p=1) 13×13×256 884,992 299M
MaxPool (3×3, s=2) 6×6×256 0
FC6 (9216→4096) 4096 37,752,832 75.5M
FC7 (4096→4096) 4096 16,781,312 33.6M
FC8 (4096→1000) 1000 4,097,000 8.2M
Total ~62.4M ~2.27G (≈724M MACs)

Key observation: 96% of parameters are in FC layers (FC6 alone: 37.7M), but 95% of FLOPs are in conv layers. This mismatch — FC layers dominating parameters while conv layers dominate compute — is the central tension of classical CNN design.

11.3 VGGNet (Simonyan & Zisserman, 2015)

VGG-16 Architecture

Input: 224×224×3

Key innovation: Exclusively 3×3 convolutions, demonstrating that depth with small kernels outperforms shallow networks with large kernels.

Block Layers Output Size Params FLOPs
Block 1 2 × Conv(64, 3×3) + MaxPool 112×112×64 38.7K 3.9G
Block 2 2 × Conv(128, 3×3) + MaxPool 56×56×128 221.4K 3.7G
Block 3 3 × Conv(256, 3×3) + MaxPool 28×28×256 1.48M 3.0G
Block 4 3 × Conv(512, 3×3) + MaxPool 14×14×512 5.90M 2.4G
Block 5 3 × Conv(512, 3×3) + MaxPool 7×7×512 7.08M 0.6G
FC layers FC(25088→4096→4096→1000) 1000 123.6M 0.25G
Total ~138.4M ~15.5G (≈7.6G MACs)

Key observation: FC layers account for 89% of parameters but only 1.6% of FLOPs. VGG showed that going deeper works, but at enormous computational cost.

11.4 GoogLeNet / Inception (Szegedy et al., 2015)

Inception Module — The Key Innovation

The Inception module runs multiple convolution branches in parallel (1×1, 3×3, 5×5, and pooling) and concatenates results. Critically, 1×1 convolutions are used before the 3×3 and 5×5 convolutions to reduce channel dimensionality.

Without 1×1 reduction (naïve): Very high FLOPs from 5×5 convolutions on many channels.

With 1×1 reduction: Reduces channels before expensive operations, dramatically cutting FLOPs.

Metric GoogLeNet vs. VGG-16
Parameters ~6.8M 20× fewer than VGG
FLOPs ~1.5G 10× fewer than VGG
Top-5 accuracy (ILSVRC 2014) 6.67% error Similar accuracy with 10–20× less compute

This was the first architecture to demonstrate that clever design can dramatically reduce compute while maintaining accuracy.

11.5 ResNet (He et al., 2016)

The Residual Connection

Instead of learning $\bm{y} = F(\bm{x})$ directly, learn the residual $\bm{y} = F(\bm{x}) + \bm{x}$. The skip connection $+\bm{x}$ costs:

Extra parameters: 0 (just an addition)

Extra FLOPs: 1 addition per element — negligible

Extra memory: 0 (input is already stored)

The skip connection provides a direct gradient path that bypasses the nonlinear layers, enabling training of networks with 100+ layers where previously ~20 layers was the practical limit.

ResNet Bottleneck Block — Computational Analysis

The bottleneck block uses 1×1 → 3×3 → 1×1 convolutions with channel reduction:

Input: $H \times W \times 256$

1×1 conv: 256 → 64 channels (reduce)

3×3 conv: 64 → 64 channels (spatial processing at reduced dims)

1×1 conv: 64 → 256 channels (expand back)

Skip connection: add original input

Sub-layer Params FLOPs (at 56×56)
1×1 (256→64) 16,448 103M
3×3 (64→64) 36,928 231M
1×1 (64→256) 16,640 104M
Skip addition 0 0.8M
Bottleneck total 70,016 439M
Equivalent plain (two 3×3, 256→256) 1,180,160 7.4G

The bottleneck block is 16.8× fewer parameters and 16.9× fewer FLOPs than the equivalent non-bottleneck design, while achieving better accuracy due to greater depth.

ResNet Family — Summary
Model Layers Params FLOPs Top-1 Acc
ResNet-18 18 11.7M 1.8G 69.8%
ResNet-34 34 21.8M 3.7G 73.3%
ResNet-50 50 25.6M 3.8G 76.1%
ResNet-101 101 44.5M 7.6G 77.4%
ResNet-152 152 60.2M 11.5G 78.3%

Note: ResNet-50 (bottleneck) has fewer FLOPs than ResNet-34 (plain) despite being deeper, because bottleneck blocks are more FLOP-efficient.

11.6 Efficient Architectures: MobileNet, ShuffleNet, EfficientNet

MobileNet v1 (Howard et al., 2017)

MobileNet v1

Replaces all standard convolutions with depthwise separable convolutions. Introduces two scaling hyperparameters:

Width multiplier $\alpha$: scales channel count by $\alpha$ at every layer

Resolution multiplier $\rho$: scales input resolution by $\rho$

FLOP scaling: FLOPs $\propto \alpha^2 \rho^2$ — quadratic in both!

Config Params FLOPs Top-1
MobileNet 1.0 (224) 4.2M 569M 70.6%
MobileNet 0.75 (224) 2.6M 325M 68.4%
MobileNet 0.5 (160) 1.3M 76M 60.2%

MobileNet v2 (Sandler et al., 2018)

MobileNet v2 — Inverted Residual Block

Key innovation: inverted residuals with linear bottlenecks.

Standard bottleneck: wide → narrow → wide (reduce, process, expand)

Inverted: narrow → wide → narrow (expand, depthwise, project)

The expansion factor $t$ (typically 6) expands channels before the depthwise conv, then projects back down. The skip connection connects the narrow representations.

Sub-layer Operation
1×1 expand $C \to tC$ (pointwise, e.g., 24→144)
3×3 depthwise $tC \to tC$ (spatial filtering, no channel mixing)
1×1 project $tC \to C'$ (pointwise, linear — no activation!)
Skip Add input (when $C = C'$ and stride=1)

MobileNetV2 (1.0, 224): 3.4M params, 300M FLOPs, 72.0% top-1

EfficientNet (Tan & Le, 2019)

EfficientNet — Compound Scaling

Key insight: depth, width, and resolution should be scaled together, not independently. The compound scaling formula:

$$\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi$$ $$\text{subject to: } \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$$

The constraint ensures that when $\phi$ increases by 1, total FLOPs roughly double. The base network (EfficientNet-B0) was found via neural architecture search (NAS).

Model Params FLOPs Top-1 vs. ResNet-50
EfficientNet-B0 5.3M 390M 77.1% +1.0% acc, 10× fewer FLOPs
EfficientNet-B3 12M 1.8G 81.6% +5.5% acc, 2× fewer FLOPs
EfficientNet-B7 66M 37G 84.3% +6.2% acc, same FLOPs as ResNet-152

References for §11

[2] LeCun et al. (1998). Gradient-Based Learning Applied to Document Recognition.

[15] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012. (AlexNet)

[13] Simonyan & Zisserman (2015). Very Deep Convolutional Networks. ICLR 2015. (VGGNet)

[16] Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015. (GoogLeNet/Inception)

[17] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. (ResNet)

[6] Howard et al. (2017). MobileNets.

[18] Sandler, M. et al. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR 2018.

[19] Zhang, X. et al. (2018). ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CVPR 2018.

[20] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019.

[21] Canziani, A., Paszke, A., & Culurciello, E. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678. — Comprehensive comparison.

11.7 Grand Comparison Table

Architecture Year Params FLOPs Top-1 (%) FLOPs/Param Key Innovation
LeNet-5 1998 61K 0.7M 11.5 First CNN
AlexNet 2012 62.4M 2.27G 63.3 36.4 ReLU, GPU, dropout
VGG-16 2014 138.4M 15.5G 74.4 112 3×3 only, depth
GoogLeNet 2014 6.8M 1.5G 74.8 221 Inception, 1×1 reduction
ResNet-50 2015 25.6M 3.8G 76.1 148 Skip connections, bottleneck
MobileNetV2 2018 3.4M 300M 72.0 88 Inv. residual, depthwise sep.
EfficientNet-B0 2019 5.3M 390M 77.1 74 Compound scaling, NAS
EfficientNet-B7 2019 66M 37G 84.3 561 Compound scaling (max)
Key Trend — Efficiency Frontier

The evolution of CNN architectures is fundamentally a story of extracting more accuracy from fewer FLOPs. From AlexNet to EfficientNet-B0, the community achieved +14% accuracy improvement while using 6× fewer FLOPs. The main techniques driving this:

(1) 1×1 convolutions for channel reduction (Inception, 2014)

(2) Residual connections for depth without degradation (ResNet, 2015)

(3) Depthwise separable convolutions for 8–9× FLOP reduction per layer (MobileNet, 2017)

(4) Compound scaling for optimal width/depth/resolution balance (EfficientNet, 2019)


§ Computational Summary — All CNN Layer Types

Layer Type Parameters FLOPs (forward, per sample) Output Size
Conv2D $C_o C_i k^2 + C_o$ $2 C_o H_o W_o C_i k^2$ $C_o \times H_o \times W_o$
Depthwise Conv $C_i k^2 + C_i$ $2 C_i H_o W_o k^2$ $C_i \times H_o \times W_o$
Pointwise (1×1) $C_o C_i + C_o$ $2 C_o H W C_i$ $C_o \times H \times W$
DW Separable $C_i(k^2 + C_o)$ $2 H_o W_o C_i(k^2 + C_o)$ $C_o \times H_o \times W_o$
Grouped Conv (G groups) $C_o C_i k^2 / G$ $2 C_o H_o W_o C_i k^2 / G$ $C_o \times H_o \times W_o$
Max Pool 0 $(k^2-1) C H_o W_o$ $C \times H_o \times W_o$
Avg Pool 0 $k^2 C H_o W_o$ $C \times H_o \times W_o$
Global Avg Pool 0 $C \cdot H \cdot W$ $C \times 1 \times 1$
BatchNorm $2C$ $\sim 5C \cdot H \cdot W$ Same as input
Skip Connection 0 $C \cdot H \cdot W$ (element-wise add) Same as input

Memory Formulas

Component Size (float32 bytes)
Parameters $4P$
Gradients $4P$
Optimizer (Adam) $8P$ (m + v)
Activation (per layer, training) $4 \times B \times C_o \times H_o \times W_o$
im2col buffer $4 \times B \times C_i k^2 \times H_o \times W_o$
Max pool indices $4 \times B \times C \times H_o \times W_o$

§ Complete References

Textbooks

[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Ch. 9: Convolutional Networks. MIT Press.

Foundational CNN Papers

[2] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proc. IEEE, 86(11), 2278–2324.

[15] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012.

[13] Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.

[16] Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015.

[17] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.

Efficient Architectures

[6] Howard, A. G. et al. (2017). MobileNets: Efficient CNNs for Mobile Vision Applications. arXiv:1704.04861.

[7] Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. CVPR 2017.

[18] Sandler, M. et al. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR 2018.

[19] Zhang, X. et al. (2018). ShuffleNet. CVPR 2018.

[20] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.

Convolution Algorithms & Efficiency

[3] Chellapilla, K. et al. (2006). High Performance CNNs for Document Processing. IWFHR 2006.

[4] Mathieu, M. et al. (2014). Fast Training of CNNs through FFTs. ICLR 2014.

[5] Lavin, A. & Gray, S. (2016). Fast Algorithms for Convolutional Neural Networks. CVPR 2016.

[12] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.

[21] Canziani, A. et al. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.

Other Referenced Works

[8] Lin, M., Chen, Q., & Yan, S. (2014). Network in Network. ICLR 2014.

[9] Springenberg, J. T. et al. (2015). Striving for Simplicity: The All Convolutional Net. ICLR Workshop 2015.

[10] Yu, F. & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016.

[11] Dumoulin, V. & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285.

[14] Luo, W. et al. (2016). Understanding the Effective Receptive Field in Deep CNNs. NeurIPS 2016.

Back to Notebook Index
Total visits:
§
Page visits: