The convolution operation is the defining computational primitive of CNNs. It replaces the general matrix multiplication of fully-connected layers with a structured, sparse computation that exploits the spatial structure of visual data. Understanding exactly how convolution works — and how it maps to hardware — is essential for computational efficiency analysis.
Given an input feature map $\bm{I} \in \R^{H_{\text{in}} \times W_{\text{in}}}$ and a kernel $\bm{K} \in \R^{k_h \times k_w}$:
$$(\bm{I} * \bm{K})(i, j) = \sum_{m=0}^{k_h - 1} \sum_{n=0}^{k_w - 1} \bm{I}(i + m,\; j + n) \cdot \bm{K}(m, n)$$Technically this is cross-correlation, not convolution (true convolution flips the kernel: $\bm{K}(k_h - 1 - m, k_w - 1 - n)$). Deep learning uses cross-correlation but universally calls it "convolution." Since kernels are learned, the flip is irrelevant — the network simply learns the flipped version.
In practice, inputs have $C_{\text{in}}$ channels and we produce $C_{\text{out}}$ output channels. The full convolution for one output pixel at position $(i,j)$ of output channel $c_{\text{out}}$:
$$\bm{O}(c_{\text{out}}, i, j) = \sum_{c=0}^{C_{\text{in}}-1} \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} \bm{I}(c, i \cdot s + m, j \cdot s + n) \cdot \bm{K}(c_{\text{out}}, c, m, n) + b_{c_{\text{out}}}$$where $s$ is the stride and $\bm{K} \in \R^{C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w}$ is the 4D kernel tensor.
These three properties explain why CNNs are dramatically more parameter-efficient and computationally efficient than fully-connected networks for spatial data.
Each output neuron connects to only a $k \times k$ local patch of the input, not the entire input. For an input of size $H \times W$:
FC layer connections per output neuron: $H \times W \times C_{\text{in}}$
Conv layer connections per output neuron: $k^2 \times C_{\text{in}}$
Ratio: $\frac{k^2}{H \times W}$. For a 3×3 kernel on 224×224 input: $\frac{9}{50{,}176} = 0.018\%$
This alone reduces parameters by a factor of ~5,500 for this example.
The same kernel is applied at every spatial position. One filter has $k^2 \times C_{\text{in}}$ parameters regardless of input resolution.
FC layer to produce one output channel from 224×224×3 input: $224 \times 224 \times 3 = 150{,}528$ parameters per output neuron
Conv 3×3 filter for same: $3 \times 3 \times 3 = 27$ parameters per output channel
Weight sharing also makes CNNs naturally resolution-independent — the same learned filters work on any input size.
If the input shifts by $(dx, dy)$, the output shifts by the same amount (adjusted for stride). Formally:
$$f(\text{shift}(\bm{x})) = \text{shift}(f(\bm{x}))$$This means the network doesn't need to learn separate detectors for a feature at every location — one kernel detects the feature everywhere. This property comes directly from weight sharing and has no additional computational cost.
Processing a 224×224×3 image to produce 64 feature maps:
| Approach | Parameters | Ratio |
|---|---|---|
| Fully connected (flatten → 64) | $224 \times 224 \times 3 \times 64 = 9{,}633{,}792$ | 1× |
| Conv 3×3 (3 → 64 channels) | $3 \times 3 \times 3 \times 64 + 64 = 1{,}792$ | 5,375× fewer |
| Conv 7×7 (3 → 64 channels) | $7 \times 7 \times 3 \times 64 + 64 = 9{,}472$ | 1,017× fewer |
Although convolution is defined as a sliding-window dot product, in practice it is never computed that way on GPUs. Instead, the input is rearranged into a matrix using the im2col (image-to-column) transformation, converting convolution into a single large matrix multiplication. This is the standard approach used by cuDNN, MKL-DNN, and all major deep learning frameworks.
For input $\bm{I} \in \R^{C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}}}$ and kernel size $k \times k$, stride $s$:
Step 1 — im2col: Extract every $k \times k \times C_{\text{in}}$ patch from $\bm{I}$ and flatten each into a column vector. This produces a matrix:
$$\bm{I}_{\text{col}} \in \R^{(C_{\text{in}} \cdot k^2) \times (H_{\text{out}} \cdot W_{\text{out}})}$$Step 2 — Reshape kernel: Reshape $\bm{K} \in \R^{C_{\text{out}} \times C_{\text{in}} \times k \times k}$ into:
$$\bm{K}_{\text{row}} \in \R^{C_{\text{out}} \times (C_{\text{in}} \cdot k^2)}$$Step 3 — Matrix multiply:
$$\bm{O}_{\text{col}} = \bm{K}_{\text{row}} \cdot \bm{I}_{\text{col}} \in \R^{C_{\text{out}} \times (H_{\text{out}} \cdot W_{\text{out}})}$$Step 4 — Reshape output: Reshape $\bm{O}_{\text{col}}$ back to $\R^{C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}$
Original input memory: $C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}}$ elements
im2col matrix memory: $C_{\text{in}} \times k^2 \times H_{\text{out}} \times W_{\text{out}}$ elements
For stride-1 convolution ($H_{\text{out}} \approx H_{\text{in}}$), the overhead is a factor of approximately $k^2$. For a 3×3 kernel, im2col uses ~9× more memory than the original input. For 7×7: ~49×. This is the fundamental memory-for-speed trade-off in CNN computation.
| Algorithm | FLOPs | Extra Memory | Best For |
|---|---|---|---|
| Direct (nested loops) | $2 C_{\text{out}} C_{\text{in}} k^2 H_{\text{out}} W_{\text{out}}$ | None | Small layers, memory-limited |
| im2col + GEMM | Same | $O(k^2 \times \text{input})$ | Most cases (standard approach) |
| FFT-based | $O(C_{\text{out}} C_{\text{in}} H W \log(HW))$ | $O(HW)$ per channel | Large kernels ($k \geq 7$) |
| Winograd | Reduces mults by up to 2.25× for 3×3 | Moderate transform overhead | 3×3 kernels specifically |
The Winograd algorithm (Lavin & Gray, 2016) is particularly important for modern CNNs that use almost exclusively 3×3 kernels. For $F(2 \times 2, 3 \times 3)$ (computing a 2×2 output tile with a 3×3 kernel), it reduces the multiplications from 36 to 16 per tile — a 2.25× reduction. cuDNN automatically selects the best algorithm for each layer.
[1] Goodfellow et al. (2016). Deep Learning, Ch. 9: Convolutional Networks.
[2] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proc. IEEE, 86(11), 2278–2324.
[3] Chellapilla, K., Puri, S., & Simard, P. (2006). High Performance Convolutional Neural Networks for Document Processing. IWFHR 2006. — First use of im2col for CNNs.
[4] Mathieu, M., Henaff, M., & LeCun, Y. (2014). Fast Training of Convolutional Networks through FFTs. ICLR 2014.
[5] Lavin, A. & Gray, S. (2016). Fast Algorithms for Convolutional Neural Networks. CVPR 2016. — Winograd for CNNs.
Input: $\bm{X} \in \R^{B \times C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}}}$
Kernel: $\bm{W} \in \R^{C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w}$
Bias: $\bm{b} \in \R^{C_{\text{out}}}$
Hyperparameters: stride $s$, padding $p$, dilation $d$
For the common case of $d=1$ and $k_h = k_w = k$:
$$H_{\text{out}} = \left\lfloor\frac{H_{\text{in}} + 2p - k}{s}\right\rfloor + 1$$| Metric | Formula |
|---|---|
| Parameters | $C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w + C_{\text{out}}$ |
| Parameter memory (float32) | $\text{params} \times 4$ bytes |
| FLOPs per output element | $2 \times C_{\text{in}} \times k_h \times k_w$ (one MAC per kernel element) |
| Total output elements | $B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}$ |
| Total forward FLOPs | $\boxed{2 \times B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}} \times C_{\text{in}} \times k_h \times k_w}$ |
| Output activation memory | $B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}} \times 4$ bytes |
| Input cache for backprop | $B \times C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}} \times 4$ bytes |
| Backward FLOPs | $\approx 2 \times$ forward FLOPs |
A conv layer: 64 input channels, 128 output channels, 3×3 kernel, stride 1, padding 1, input 56×56, batch 32.
| Metric | Computation | Value |
|---|---|---|
| Output size | $\lfloor(56 + 2 - 3)/1\rfloor + 1$ | 56 × 56 |
| Parameters | $128 \times 64 \times 3 \times 3 + 128$ | 73,856 |
| Param memory | $73{,}856 \times 4$ | 295 KB |
| FLOPs (forward) | $2 \times 32 \times 128 \times 56 \times 56 \times 64 \times 9$ | 14.8 GFLOPs |
| Output memory | $32 \times 128 \times 56 \times 56 \times 4$ | 51.4 MB |
| Input cache | $32 \times 64 \times 56 \times 56 \times 4$ | 25.7 MB |
Note: 73K parameters produce 14.8 billion FLOPs — a ratio of 200,000 FLOPs per parameter. This extreme ratio (compared to ~2 FLOPs per parameter for FC layers) is due to weight sharing: each kernel weight contributes to $H_{\text{out}} \times W_{\text{out}} = 3{,}136$ output positions.
| Type | Value of $p$ | Effect on Output Size | Use Case |
|---|---|---|---|
| Valid (no padding) | $p = 0$ | Output shrinks by $k-1$ in each dimension | When output size reduction is acceptable |
| Same padding | $p = \lfloor k/2 \rfloor$ | $H_{\text{out}} = H_{\text{in}}$ (for $s=1$) | Most common — preserves spatial dimensions |
| Full padding | $p = k - 1$ | Output grows | Transposed convolution |
Zero-padding has essentially zero computational cost — it is implemented via memory indexing rather than actually writing zeros.
Stride $s > 1$ reduces the output spatial dimension by a factor of $s$ in each dimension, reducing the total FLOPs by a factor of $s^2$:
$$\frac{\text{FLOPs}(s>1)}{\text{FLOPs}(s=1)} \approx \frac{1}{s^2}$$Stride-2 convolution is often used as an alternative to pooling for downsampling (Springenberg et al., 2015), with the advantage that it learns the downsampling rather than using a fixed rule.
Dilation $d$ inserts $d-1$ zeros between kernel elements, creating an effective kernel size of $k_{\text{eff}} = k + (k-1)(d-1)$ without adding any parameters. For a 3×3 kernel with $d=2$, the effective receptive field is 5×5 using only 9 parameters (vs. 25 for an actual 5×5 kernel).
Parameters: unchanged — $C_{\text{out}} \times C_{\text{in}} \times k^2$
FLOPs per output element: unchanged — $2 \times C_{\text{in}} \times k^2$ (same number of multiplications)
Effective receptive field: $k_{\text{eff}} = k + (k-1)(d-1)$
Dilation is computationally "free" in terms of FLOPs — it expands the receptive field without increasing computation.
Pooling layers have no learnable parameters. They reduce spatial dimensions by applying a fixed function over local windows.
| Metric | Formula |
|---|---|
| Parameters | 0 |
| FLOPs (forward) | $(k^2 - 1)$ comparisons per output element × $B \times C \times H_{\text{out}} \times W_{\text{out}}$ |
| Backward | Gradient routed only to the max element; need to store index of max for each window |
| Index memory | $B \times C \times H_{\text{out}} \times W_{\text{out}}$ integers (typically 4 bytes each) |
For the standard 2×2 max pool with stride 2: 3 comparisons per output element. With $s = 2$, the output has $1/4$ the spatial elements, so pooling reduces subsequent layer FLOPs by 4×.
FLOPs: $(k^2 - 1)$ additions + 1 division per output element. Backward: gradient distributed equally ($1/k^2$) to all input elements in the window — no index storage needed.
Averages each channel over the entire spatial extent:
$$\text{GAP}(c) = \frac{1}{H \times W}\sum_{i=0}^{H-1}\sum_{j=0}^{W-1}\bm{X}(c, i, j)$$Output: $B \times C \times 1 \times 1$ — one value per channel per sample
FLOPs: $B \times C \times (HW - 1)$ additions ≈ $B \times C \times HW$
Key insight: GAP eliminates the need for large FC layers at the end of the network, saving enormous parameter counts. For example, replacing VGG-16's FC layers with GAP saves ~123M parameters.
Lin et al. (2014), "Network in Network"
A convolution with $k = 1$. Kernel: $\bm{W} \in \R^{C_{\text{out}} \times C_{\text{in}} \times 1 \times 1}$. This applies a fully-connected layer independently at each spatial position — it mixes channels without any spatial interaction.
| Metric | Formula |
|---|---|
| Parameters | $C_{\text{out}} \times C_{\text{in}} + C_{\text{out}}$ |
| FLOPs | $2 \times B \times C_{\text{out}} \times H \times W \times C_{\text{in}}$ |
| Output size | $B \times C_{\text{out}} \times H \times W$ (spatial dims unchanged) |
1×1 convolutions are used for channel dimension reduction before expensive 3×3 or 5×5 convolutions. This is the key idea behind the Inception module and ResNet bottleneck blocks.
Reducing from 256 to 64 channels before a 3×3 conv, feature map 56×56:
Without 1×1: 3×3 conv from 256→256: $2 \times 256 \times 56 \times 56 \times 256 \times 9 = 2.93$G FLOPs
With 1×1 bottleneck:
1×1 conv 256→64: $2 \times 64 \times 56 \times 56 \times 256 = 102.8$M FLOPs
3×3 conv 64→64: $2 \times 64 \times 56 \times 56 \times 64 \times 9 = 231.2$M FLOPs
1×1 conv 64→256: $2 \times 256 \times 56 \times 56 \times 64 = 102.8$M FLOPs
Total with bottleneck: 436.8M FLOPs — a 6.7× reduction!
This is one of the most important efficiency innovations in CNN design. It factorizes a standard convolution into two steps: a depthwise convolution (spatial filtering per channel) and a pointwise convolution (channel mixing via 1×1 conv).
Step 1 — Depthwise Conv: Apply one $k \times k$ filter per input channel independently. Each channel is convolved separately.
Kernel: $\bm{W}_{\text{dw}} \in \R^{C_{\text{in}} \times 1 \times k \times k}$ — one filter per channel, not $C_{\text{out}}$ filters across all channels.
Output: $B \times C_{\text{in}} \times H_{\text{out}} \times W_{\text{out}}$ (same number of channels as input)
Step 2 — Pointwise Conv (1×1): Mix channels to produce $C_{\text{out}}$ output channels.
Kernel: $\bm{W}_{\text{pw}} \in \R^{C_{\text{out}} \times C_{\text{in}} \times 1 \times 1}$
| Standard Conv | Depthwise | Pointwise | Separable Total | |
|---|---|---|---|---|
| Parameters | $C_o C_i k^2$ | $C_i k^2$ | $C_o C_i$ | $C_i k^2 + C_o C_i$ |
| FLOPs | $2 C_o H_o W_o C_i k^2$ | $2 C_i H_o W_o k^2$ | $2 C_o H_o W_o C_i$ | $2 H_o W_o C_i(k^2 + C_o)$ |
Reduction ratio (separable ÷ standard):
$$\frac{C_i(k^2 + C_o)}{C_o \cdot C_i \cdot k^2} = \frac{1}{C_o} + \frac{1}{k^2}$$For $k = 3$, $C_o = 128$: $\frac{1}{128} + \frac{1}{9} \approx 0.119$ — an 8.4× reduction in both params and FLOPs.
For $k = 3$, $C_o = 256$: $\frac{1}{256} + \frac{1}{9} \approx 0.115$ — an 8.7× reduction.
| Standard | Depthwise Separable | Ratio | |
|---|---|---|---|
| Parameters | $128 \times 128 \times 9 = 147{,}456$ | $128 \times 9 + 128 \times 128 = 17{,}536$ | 8.4× fewer |
| FLOPs | $2 \times 128 \times 3136 \times 128 \times 9 = 924$M | $2 \times 3136 \times 128 \times (9 + 128) = 110$M | 8.4× fewer |
[6] Howard, A. G. et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861.
[7] Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. CVPR 2017.
Divide $C_{\text{in}}$ into $G$ groups of $C_{\text{in}}/G$ channels each. Each group is convolved independently to produce $C_{\text{out}}/G$ output channels. Results are concatenated.
Parameters: $\frac{C_{\text{out}} \times C_{\text{in}}}{G} \times k^2$ — reduced by factor $G$
FLOPs: reduced by factor $G$
Special cases: $G = 1$ → standard conv; $G = C_{\text{in}}$ → depthwise conv
Originally used by Krizhevsky et al. (2012) with $G=2$ to split computation across two GPUs.
Used for upsampling (decoder networks, GANs, segmentation). Inserts $s-1$ zeros between input elements, then performs a standard convolution.
$$H_{\text{out}} = (H_{\text{in}} - 1) \times s - 2p + k + p_{\text{out}}$$Parameters: same as standard conv of same kernel size: $C_{\text{out}} \times C_{\text{in}} \times k^2$
FLOPs: similar order to standard conv on the (larger) output size
Dumoulin & Visin (2016), "A guide to convolution arithmetic for deep learning"
Given upstream gradient $\pd{L}{\bm{O}} \in \R^{B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}}$:
(a) Gradient w.r.t. kernel weights:
$$\pd{L}{\bm{W}} = \text{conv2d}\!\left(\bm{X},\; \pd{L}{\bm{O}}\right)$$Correlation of the input $\bm{X}$ with the upstream gradient. Uses the cached input from the forward pass.
FLOPs: $2 \times B \times C_{\text{out}} \times C_{\text{in}} \times k^2 \times H_{\text{out}} \times W_{\text{out}}$ — same as forward pass.
(b) Gradient w.r.t. biases:
$$\pd{L}{b_{c}} = \sum_{i,j} \pd{L}{O_{c,i,j}} \quad \text{(sum over spatial dims and batch)}$$FLOPs: $B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}}$ — negligible.
(c) Gradient w.r.t. input (for propagation to previous layer):
$$\pd{L}{\bm{X}} = \text{full\_conv2d}\!\left(\pd{L}{\bm{O}},\; \text{flip}(\bm{W})\right)$$A "full" convolution (with padding $k-1$) of the upstream gradient with the rotated (180°) kernel. This is mathematically equivalent to the transposed convolution.
FLOPs: $2 \times B \times C_{\text{in}} \times H_{\text{in}} \times W_{\text{in}} \times C_{\text{out}} \times k^2$ — approximately same as forward.
Total backward ≈ 2× forward FLOPs, consistent with the general rule for neural networks.
[1] Goodfellow et al. (2016). Deep Learning, Ch. 9.1–9.5.
[8] Lin, M., Chen, Q., & Yan, S. (2014). Network in Network. ICLR 2014. — 1×1 convolutions and GAP.
[9] Springenberg, J. T. et al. (2015). Striving for Simplicity: The All Convolutional Net. ICLR Workshop 2015.
[10] Yu, F. & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016.
[11] Dumoulin, V. & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285. — Essential reference for all convolution types.
[12] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12).
The receptive field of a neuron in layer $l$ is the region of the original input that can influence that neuron's value. It grows with depth as each layer aggregates information from a local neighborhood.
For a stack of convolutional layers $l = 1, 2, \ldots, L$, each with kernel size $k_l$ and stride $s_l$:
$$\text{RF}_l = \text{RF}_{l-1} + (k_l - 1) \times \prod_{i=1}^{l-1} s_i$$with $\text{RF}_0 = 1$. The product term $j_l = \prod_{i=1}^{l-1} s_i$ is called the jump — it represents how many input pixels correspond to one step in layer $l$'s output.
For layers with identical $k$ and $s=1$:
$$\text{RF}_L = 1 + L(k - 1) = L(k-1) + 1$$For all 3×3 layers with stride 1: $\text{RF}_L = 2L + 1$
| Layers | RF | Parameters (per C→C) |
|---|---|---|
| 1 × (3×3) | 3 × 3 | $9C^2$ |
| 2 × (3×3) | 5 × 5 | $18C^2$ |
| 3 × (3×3) | 7 × 7 | $27C^2$ |
| 5 × (3×3) | 11 × 11 | $45C^2$ |
Compare: one 7×7 kernel has RF = 7×7 but uses $49C^2$ parameters, while three 3×3 kernels achieve the same RF with only $27C^2$ parameters — 45% fewer parameters, plus 3 non-linearities instead of 1. This is the core insight of VGGNet (Simonyan & Zisserman, 2015).
The theoretical receptive field assumes that every pixel within the RF contributes equally. In practice, Luo et al. (2016) showed that the effective receptive field is much smaller — it follows a Gaussian distribution, with the center pixels contributing far more than the periphery. Typically, the effective RF is only about 30–50% of the theoretical RF in each dimension. This has important implications for architecture design: simply making the network deeper doesn't linearly increase the effective RF.
| Configuration | RF | Params | FLOPs (per position) | Non-linearities |
|---|---|---|---|---|
| 1 × (5×5) | 5 | $25C^2$ | $50C^2$ | 1 |
| 2 × (3×3) | 5 | $18C^2$ | $36C^2$ | 2 |
| 1 × (7×7) | 7 | $49C^2$ | $98C^2$ | 1 |
| 3 × (3×3) | 7 | $27C^2$ | $54C^2$ | 3 |
| 1 × (11×11) | 11 | $121C^2$ | $242C^2$ | 1 |
| 5 × (3×3) | 11 | $45C^2$ | $90C^2$ | 5 |
In every case, the stack of 3×3 convolutions uses fewer parameters, fewer FLOPs, and adds more non-linearity. This is why virtually all modern CNN architectures use 3×3 kernels exclusively.
[13] Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. (VGGNet)
[14] Luo, W. et al. (2016). Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. NeurIPS 2016.
We now trace the evolution of CNN architectures, analyzing each for parameter count, FLOPs, and the key computational innovations they introduced. All FLOP counts are for a single forward pass on one image at the standard input resolution.
Input: 32×32×1 (grayscale)
Layers: Conv(6, 5×5, s=1) → AvgPool(2×2, s=2) → Conv(16, 5×5, s=1) → AvgPool(2×2, s=2) → FC(120) → FC(84) → FC(10)
| Layer | Output Size | Params | FLOPs |
|---|---|---|---|
| Conv1 (1→6, 5×5) | 28×28×6 | 156 | 122K |
| AvgPool (2×2) | 14×14×6 | 0 | 3.5K |
| Conv2 (6→16, 5×5) | 10×10×16 | 2,416 | 483K |
| AvgPool (2×2) | 5×5×16 | 0 | 1.2K |
| FC1 (400→120) | 120 | 48,120 | 96K |
| FC2 (120→84) | 84 | 10,164 | 20K |
| FC3 (84→10) | 10 | 850 | 1.7K |
| Total | ~61K | ~727K |
Historical significance: First successful CNN application (handwritten digit recognition). Established the Conv→Pool→Conv→Pool→FC pattern.
Input: 227×227×3
Key innovations: ReLU activation, dropout, GPU training, data augmentation
| Layer | Output Size | Params | FLOPs |
|---|---|---|---|
| Conv1 (3→96, 11×11, s=4) | 55×55×96 | 34,944 | 211M |
| MaxPool (3×3, s=2) | 27×27×96 | 0 | — |
| Conv2 (96→256, 5×5, p=2) | 27×27×256 | 614,656 | 896M |
| MaxPool (3×3, s=2) | 13×13×256 | 0 | — |
| Conv3 (256→384, 3×3, p=1) | 13×13×384 | 885,120 | 299M |
| Conv4 (384→384, 3×3, p=1) | 13×13×384 | 1,327,488 | 449M |
| Conv5 (384→256, 3×3, p=1) | 13×13×256 | 884,992 | 299M |
| MaxPool (3×3, s=2) | 6×6×256 | 0 | — |
| FC6 (9216→4096) | 4096 | 37,752,832 | 75.5M |
| FC7 (4096→4096) | 4096 | 16,781,312 | 33.6M |
| FC8 (4096→1000) | 1000 | 4,097,000 | 8.2M |
| Total | ~62.4M | ~2.27G (≈724M MACs) |
Key observation: 96% of parameters are in FC layers (FC6 alone: 37.7M), but 95% of FLOPs are in conv layers. This mismatch — FC layers dominating parameters while conv layers dominate compute — is the central tension of classical CNN design.
Input: 224×224×3
Key innovation: Exclusively 3×3 convolutions, demonstrating that depth with small kernels outperforms shallow networks with large kernels.
| Block | Layers | Output Size | Params | FLOPs |
|---|---|---|---|---|
| Block 1 | 2 × Conv(64, 3×3) + MaxPool | 112×112×64 | 38.7K | 3.9G |
| Block 2 | 2 × Conv(128, 3×3) + MaxPool | 56×56×128 | 221.4K | 3.7G |
| Block 3 | 3 × Conv(256, 3×3) + MaxPool | 28×28×256 | 1.48M | 3.0G |
| Block 4 | 3 × Conv(512, 3×3) + MaxPool | 14×14×512 | 5.90M | 2.4G |
| Block 5 | 3 × Conv(512, 3×3) + MaxPool | 7×7×512 | 7.08M | 0.6G |
| FC layers | FC(25088→4096→4096→1000) | 1000 | 123.6M | 0.25G |
| Total | ~138.4M | ~15.5G (≈7.6G MACs) |
Key observation: FC layers account for 89% of parameters but only 1.6% of FLOPs. VGG showed that going deeper works, but at enormous computational cost.
The Inception module runs multiple convolution branches in parallel (1×1, 3×3, 5×5, and pooling) and concatenates results. Critically, 1×1 convolutions are used before the 3×3 and 5×5 convolutions to reduce channel dimensionality.
Without 1×1 reduction (naïve): Very high FLOPs from 5×5 convolutions on many channels.
With 1×1 reduction: Reduces channels before expensive operations, dramatically cutting FLOPs.
| Metric | GoogLeNet | vs. VGG-16 |
|---|---|---|
| Parameters | ~6.8M | 20× fewer than VGG |
| FLOPs | ~1.5G | 10× fewer than VGG |
| Top-5 accuracy (ILSVRC 2014) | 6.67% error | Similar accuracy with 10–20× less compute |
This was the first architecture to demonstrate that clever design can dramatically reduce compute while maintaining accuracy.
Instead of learning $\bm{y} = F(\bm{x})$ directly, learn the residual $\bm{y} = F(\bm{x}) + \bm{x}$. The skip connection $+\bm{x}$ costs:
Extra parameters: 0 (just an addition)
Extra FLOPs: 1 addition per element — negligible
Extra memory: 0 (input is already stored)
The skip connection provides a direct gradient path that bypasses the nonlinear layers, enabling training of networks with 100+ layers where previously ~20 layers was the practical limit.
The bottleneck block uses 1×1 → 3×3 → 1×1 convolutions with channel reduction:
Input: $H \times W \times 256$
1×1 conv: 256 → 64 channels (reduce)
3×3 conv: 64 → 64 channels (spatial processing at reduced dims)
1×1 conv: 64 → 256 channels (expand back)
Skip connection: add original input
| Sub-layer | Params | FLOPs (at 56×56) |
|---|---|---|
| 1×1 (256→64) | 16,448 | 103M |
| 3×3 (64→64) | 36,928 | 231M |
| 1×1 (64→256) | 16,640 | 104M |
| Skip addition | 0 | 0.8M |
| Bottleneck total | 70,016 | 439M |
| Equivalent plain (two 3×3, 256→256) | 1,180,160 | 7.4G |
The bottleneck block is 16.8× fewer parameters and 16.9× fewer FLOPs than the equivalent non-bottleneck design, while achieving better accuracy due to greater depth.
| Model | Layers | Params | FLOPs | Top-1 Acc |
|---|---|---|---|---|
| ResNet-18 | 18 | 11.7M | 1.8G | 69.8% |
| ResNet-34 | 34 | 21.8M | 3.7G | 73.3% |
| ResNet-50 | 50 | 25.6M | 3.8G | 76.1% |
| ResNet-101 | 101 | 44.5M | 7.6G | 77.4% |
| ResNet-152 | 152 | 60.2M | 11.5G | 78.3% |
Note: ResNet-50 (bottleneck) has fewer FLOPs than ResNet-34 (plain) despite being deeper, because bottleneck blocks are more FLOP-efficient.
Replaces all standard convolutions with depthwise separable convolutions. Introduces two scaling hyperparameters:
Width multiplier $\alpha$: scales channel count by $\alpha$ at every layer
Resolution multiplier $\rho$: scales input resolution by $\rho$
FLOP scaling: FLOPs $\propto \alpha^2 \rho^2$ — quadratic in both!
| Config | Params | FLOPs | Top-1 |
|---|---|---|---|
| MobileNet 1.0 (224) | 4.2M | 569M | 70.6% |
| MobileNet 0.75 (224) | 2.6M | 325M | 68.4% |
| MobileNet 0.5 (160) | 1.3M | 76M | 60.2% |
Key innovation: inverted residuals with linear bottlenecks.
Standard bottleneck: wide → narrow → wide (reduce, process, expand)
Inverted: narrow → wide → narrow (expand, depthwise, project)
The expansion factor $t$ (typically 6) expands channels before the depthwise conv, then projects back down. The skip connection connects the narrow representations.
| Sub-layer | Operation |
|---|---|
| 1×1 expand | $C \to tC$ (pointwise, e.g., 24→144) |
| 3×3 depthwise | $tC \to tC$ (spatial filtering, no channel mixing) |
| 1×1 project | $tC \to C'$ (pointwise, linear — no activation!) |
| Skip | Add input (when $C = C'$ and stride=1) |
MobileNetV2 (1.0, 224): 3.4M params, 300M FLOPs, 72.0% top-1
Key insight: depth, width, and resolution should be scaled together, not independently. The compound scaling formula:
$$\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi$$ $$\text{subject to: } \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$$The constraint ensures that when $\phi$ increases by 1, total FLOPs roughly double. The base network (EfficientNet-B0) was found via neural architecture search (NAS).
| Model | Params | FLOPs | Top-1 | vs. ResNet-50 |
|---|---|---|---|---|
| EfficientNet-B0 | 5.3M | 390M | 77.1% | +1.0% acc, 10× fewer FLOPs |
| EfficientNet-B3 | 12M | 1.8G | 81.6% | +5.5% acc, 2× fewer FLOPs |
| EfficientNet-B7 | 66M | 37G | 84.3% | +6.2% acc, same FLOPs as ResNet-152 |
[2] LeCun et al. (1998). Gradient-Based Learning Applied to Document Recognition.
[15] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012. (AlexNet)
[13] Simonyan & Zisserman (2015). Very Deep Convolutional Networks. ICLR 2015. (VGGNet)
[16] Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015. (GoogLeNet/Inception)
[17] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. (ResNet)
[6] Howard et al. (2017). MobileNets.
[18] Sandler, M. et al. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR 2018.
[19] Zhang, X. et al. (2018). ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CVPR 2018.
[20] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019.
[21] Canziani, A., Paszke, A., & Culurciello, E. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678. — Comprehensive comparison.
| Architecture | Year | Params | FLOPs | Top-1 (%) | FLOPs/Param | Key Innovation |
|---|---|---|---|---|---|---|
| LeNet-5 | 1998 | 61K | 0.7M | — | 11.5 | First CNN |
| AlexNet | 2012 | 62.4M | 2.27G | 63.3 | 36.4 | ReLU, GPU, dropout |
| VGG-16 | 2014 | 138.4M | 15.5G | 74.4 | 112 | 3×3 only, depth |
| GoogLeNet | 2014 | 6.8M | 1.5G | 74.8 | 221 | Inception, 1×1 reduction |
| ResNet-50 | 2015 | 25.6M | 3.8G | 76.1 | 148 | Skip connections, bottleneck |
| MobileNetV2 | 2018 | 3.4M | 300M | 72.0 | 88 | Inv. residual, depthwise sep. |
| EfficientNet-B0 | 2019 | 5.3M | 390M | 77.1 | 74 | Compound scaling, NAS |
| EfficientNet-B7 | 2019 | 66M | 37G | 84.3 | 561 | Compound scaling (max) |
The evolution of CNN architectures is fundamentally a story of extracting more accuracy from fewer FLOPs. From AlexNet to EfficientNet-B0, the community achieved +14% accuracy improvement while using 6× fewer FLOPs. The main techniques driving this:
(1) 1×1 convolutions for channel reduction (Inception, 2014)
(2) Residual connections for depth without degradation (ResNet, 2015)
(3) Depthwise separable convolutions for 8–9× FLOP reduction per layer (MobileNet, 2017)
(4) Compound scaling for optimal width/depth/resolution balance (EfficientNet, 2019)
| Layer Type | Parameters | FLOPs (forward, per sample) | Output Size |
|---|---|---|---|
| Conv2D | $C_o C_i k^2 + C_o$ | $2 C_o H_o W_o C_i k^2$ | $C_o \times H_o \times W_o$ |
| Depthwise Conv | $C_i k^2 + C_i$ | $2 C_i H_o W_o k^2$ | $C_i \times H_o \times W_o$ |
| Pointwise (1×1) | $C_o C_i + C_o$ | $2 C_o H W C_i$ | $C_o \times H \times W$ |
| DW Separable | $C_i(k^2 + C_o)$ | $2 H_o W_o C_i(k^2 + C_o)$ | $C_o \times H_o \times W_o$ |
| Grouped Conv (G groups) | $C_o C_i k^2 / G$ | $2 C_o H_o W_o C_i k^2 / G$ | $C_o \times H_o \times W_o$ |
| Max Pool | 0 | $(k^2-1) C H_o W_o$ | $C \times H_o \times W_o$ |
| Avg Pool | 0 | $k^2 C H_o W_o$ | $C \times H_o \times W_o$ |
| Global Avg Pool | 0 | $C \cdot H \cdot W$ | $C \times 1 \times 1$ |
| BatchNorm | $2C$ | $\sim 5C \cdot H \cdot W$ | Same as input |
| Skip Connection | 0 | $C \cdot H \cdot W$ (element-wise add) | Same as input |
| Component | Size (float32 bytes) |
|---|---|
| Parameters | $4P$ |
| Gradients | $4P$ |
| Optimizer (Adam) | $8P$ (m + v) |
| Activation (per layer, training) | $4 \times B \times C_o \times H_o \times W_o$ |
| im2col buffer | $4 \times B \times C_i k^2 \times H_o \times W_o$ |
| Max pool indices | $4 \times B \times C \times H_o \times W_o$ |
[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Ch. 9: Convolutional Networks. MIT Press.
[2] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proc. IEEE, 86(11), 2278–2324.
[15] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012.
[13] Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.
[16] Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015.
[17] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.
[6] Howard, A. G. et al. (2017). MobileNets: Efficient CNNs for Mobile Vision Applications. arXiv:1704.04861.
[7] Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. CVPR 2017.
[18] Sandler, M. et al. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR 2018.
[19] Zhang, X. et al. (2018). ShuffleNet. CVPR 2018.
[20] Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling. ICML 2019.
[3] Chellapilla, K. et al. (2006). High Performance CNNs for Document Processing. IWFHR 2006.
[4] Mathieu, M. et al. (2014). Fast Training of CNNs through FFTs. ICLR 2014.
[5] Lavin, A. & Gray, S. (2016). Fast Algorithms for Convolutional Neural Networks. CVPR 2016.
[12] Sze, V. et al. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE, 105(12), 2295–2329.
[21] Canziani, A. et al. (2017). An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.
[8] Lin, M., Chen, Q., & Yan, S. (2014). Network in Network. ICLR 2014.
[9] Springenberg, J. T. et al. (2015). Striving for Simplicity: The All Convolutional Net. ICLR Workshop 2015.
[10] Yu, F. & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016.
[11] Dumoulin, V. & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285.
[14] Luo, W. et al. (2016). Understanding the Effective Receptive Field in Deep CNNs. NeurIPS 2016.