Part 0 — Prologue

The Intellectual Genealogy of Deep Neural Networks

Where the ideas came from, how they connect, and a precise taxonomy of AI, Machine Learning, Deep Learning, and their sub-disciplines — the conceptual map that should exist before any equation is written

Contents

§0 Why This Document Exists

Deep neural networks did not appear from nowhere. They sit at the intersection of at least four major intellectual traditions — neuroscience, mathematical optimization, statistical learning theory, and computer hardware engineering — each of which developed over decades or centuries before converging into what we now call "deep learning." Without understanding these roots, the field appears as a disconnected collection of clever tricks rather than what it actually is: a deeply coherent system of ideas with clear intellectual ancestry.

There is a second problem. The terms "artificial intelligence," "machine learning," "deep learning," "neural networks," and many others are used loosely, inconsistently, and often interchangeably in popular discourse, in journalism, in marketing, and even in some academic papers. A student encountering this field for the first time faces a tangle of overlapping vocabulary with no authoritative map. This document provides that map.

This is Part 0 — it should be read before the six technical parts that follow. It contains no equations that aren't also in the later parts. Its purpose is entirely conceptual: to establish where the ideas came from, how they relate to each other, and what the words mean.


§1 The Conceptual Hierarchy — Precise Definitions

The relationship between AI, ML, DL, and DNNs is one of strict containment: each is a subset of the one above it. This hierarchy is not a matter of opinion — it follows from the definitions. Understanding it eliminates the most common source of confusion in the field.

1.1 Artificial Intelligence (AI)

Artificial Intelligence

AI is the broadest category. It encompasses any technique that enables a machine to mimic, simulate, or exhibit behavior that would be considered "intelligent" if performed by a human. This includes rule-based expert systems, search algorithms, logic programming, planning systems, robotics controllers, and machine learning — among many other approaches.

The key characteristic of AI: It is defined by the goal (intelligent behavior), not by the method. A hand-coded chess engine with millions of if-else rules is AI. A decision tree trained on data is AI. A deep neural network is AI. They are all AI because they all produce intelligent behavior — but they use completely different methods.

The term was coined by John McCarthy in 1956 at the Dartmouth Conference, where it was defined as "the science and engineering of making intelligent machines." This definition is deliberately broad — it includes approaches that have nothing to do with learning from data, such as symbolic reasoning, search algorithms (A*, minimax), expert systems (MYCIN, DENDRAL), and formal logic systems. These "classical AI" or "Good Old-Fashioned AI" (GOFAI) methods dominated the field from the 1950s through the 1980s and remain important in specific domains (planning, scheduling, formal verification).

1.2 Machine Learning (ML)

Machine Learning

ML is a subset of AI. It comprises methods where the system learns from data rather than being explicitly programmed with rules. Tom Mitchell's (1997) definition is the most precise: "A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$."

The key characteristic of ML: It is defined by the method — learning from data. The system improves its performance by being exposed to more data, without a human explicitly coding the rules for each new situation.

Machine learning encompasses a vast landscape of algorithms that have nothing to do with neural networks. The major families include:

ML Family Core Idea Key Algorithms Relation to DNN
Linear Models Output is a linear function of inputs Linear regression, logistic regression, perceptron A single DNN neuron is a linear model + activation
Tree-Based Partition input space via hierarchical decisions Decision trees, random forests, gradient boosting (XGBoost) Completely different paradigm — no gradient-based learning
Kernel Methods Implicit mapping to high-dimensional space SVM, Gaussian processes Competed with DNNs in 1995–2012; now largely superseded for perception tasks
Instance-Based No explicit model; compare to stored examples k-NN, locally weighted regression No overlap with DNNs
Probabilistic Graphical Models Explicit probabilistic dependencies between variables Bayesian networks, HMMs, CRFs DNNs can be combined with graphical models (e.g., VAEs)
Ensemble Methods Combine many weak learners Bagging, boosting, stacking Still dominant for tabular data; DNNs dominate for perception
Neural Networks Learn hierarchical representations via differentiable functions MLPs, CNNs, RNNs, Transformers This is the family we study in Parts I–VI
Critical Clarification

Machine learning is NOT synonymous with deep learning. Most ML algorithms have nothing to do with neural networks. Random forests, SVMs, k-NN, and gradient boosting are all ML — and they remain the best choice for many problems (particularly tabular/structured data). Deep learning is one branch of ML that happens to have achieved spectacular success on perception tasks (vision, language, speech) — but it is a branch, not the trunk.

1.3 Deep Learning (DL)

Deep Learning

DL is a subset of ML. It refers specifically to machine learning using neural networks with multiple layers (typically more than 2 hidden layers). The "deep" refers to the depth of the network — the number of sequential processing layers between input and output.

The key characteristic of DL: It is defined by both the method (learning from data) and the tool (deep neural networks). The crucial innovation is representation learning — the network automatically discovers the relevant features from raw data, rather than requiring human-engineered features.

The distinction between "shallow" ML and "deep" learning is fundamentally about feature engineering. In traditional ML, a human expert must design features (e.g., SIFT descriptors for images, TF-IDF for text, spectral features for audio). The ML algorithm then learns a mapping from these hand-crafted features to the output. In deep learning, the network learns both the features and the mapping — the early layers learn to extract low-level features, the middle layers combine them into higher-level concepts, and the final layers map these to the output. This end-to-end learning is what makes deep learning qualitatively different from previous ML approaches.

1.4 Neural Networks vs. Deep Neural Networks

The Depth Boundary

A neural network is any parameterized function composed of layers of artificial neurons (linear transformations followed by non-linear activations), trained by gradient-based optimization.

A deep neural network (DNN) is a neural network with "many" layers. There is no universally agreed-upon threshold for "deep," but in practice:

Shallow: 1–2 hidden layers (classic MLPs, single-layer perceptrons)

Deep: 3+ hidden layers (the universal approximation theorem guarantees that even shallow networks can represent any function, but deep networks represent the same functions with exponentially fewer parameters)

The practical motivation for depth is not theoretical expressiveness but efficiency: deep networks learn hierarchical representations that reuse features across the hierarchy, requiring far fewer parameters than shallow networks for the same task.

1.5 The Complete Taxonomy Map

Artificial Intelligence Any technique producing intelligent behavior Non-ML AI Methods: • Expert systems (rules) • Search (A*, minimax) • Logic programming • Planning & scheduling • Robotics controllers Machine Learning Systems that learn from data Non-DL ML Methods: • Linear / logistic regression • Decision trees, random forests • SVM, kernel methods • k-NN, Naïve Bayes • Gradient boosting (XGBoost) • Bayesian networks, HMMs • PCA, k-Means clustering Deep Learning Neural networks with multiple layers CNNs (vision) RNNs / LSTMs (sequences) Transformers (attention) GANs (generative) Autoencoders / VAEs GNNs SSMs/Mamba DL ⊂ ML ⊂ AI Every DL system is also an ML system, which is also an AI system. KEY INSIGHT Not all AI uses ML. Not all ML uses neural nets. Not all neural nets are "deep."
Figure 1.1. The strict containment hierarchy: Deep Learning ⊂ Machine Learning ⊂ Artificial Intelligence. Each ring is a proper subset of the one enclosing it. The dashed boxes show methods that belong to the enclosing ring but not to the inner ring — these are the non-ML AI methods and the non-DL ML methods that are often forgotten in popular accounts of the field.

§2 The Intellectual Pre-History — Where the Ideas Came From

Deep neural networks draw on at least four distinct intellectual traditions that developed largely independently before converging. Understanding these roots reveals that DNNs are not a single invention but a synthesis — and this synthesis is what makes the field both powerful and confusing.

2.1 Neuroscience and the Biological Neuron (1890s–1940s)

The Biological Root

Santiago Ramón y Cajal (1890s) established the "neuron doctrine" — that the nervous system is composed of discrete cells (neurons) that communicate via electrochemical signals across synapses. This was the first scientific understanding that the brain is a network of computing units.

Donald Hebb (1949) proposed Hebb's rule: "Neurons that fire together wire together." This was the first learning rule for neural connections — the idea that the strength of a synapse should increase when both the pre-synaptic and post-synaptic neurons are active simultaneously. Hebb's rule is the conceptual ancestor of weight updates in modern neural networks.

What it contributed to DNNs: The fundamental metaphor — computation as a network of simple processing units connected by weighted links, where learning changes the weights. This metaphor shaped the entire field's vocabulary (neurons, layers, weights, activation, etc.).

2.2 Mathematical Logic and Computation Theory (1930s–1950s)

The Computational Root

Alan Turing (1936) formalized the concept of computation itself with the Turing machine. His work showed that any computable function can be realized by a sufficiently complex machine — establishing the theoretical possibility that machines can perform any intellectual task that humans can describe algorithmically.

Warren McCulloch and Walter Pitts (1943) combined Turing's formalism with Cajal's neuron doctrine. They showed that networks of simplified neurons (binary threshold units) could compute any logical function computable by a Turing machine. This was the birth of artificial neural networks as a concept — the first demonstration that brain-like computing elements could, in principle, perform arbitrary computation.

What it contributed to DNNs: The theoretical foundation — neural networks as universal computers. Also the crucial simplification: real biological neurons are enormously complex, but their essential computational behavior can be captured by a simple mathematical model (weighted sum + threshold/activation function).

2.3 Statistical Learning and Optimization (1800s–1960s)

The Mathematical Root

Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1809) developed the method of least squares — fitting a model to data by minimizing the sum of squared errors. This is the direct ancestor of the loss function in neural network training.

Augustin-Louis Cauchy (1847) described gradient descent — the iterative algorithm for finding the minimum of a function by taking steps proportional to the negative gradient. This is still the core algorithm used to train every neural network today.

Ronald Fisher (1922, 1936) formalized maximum likelihood estimation and developed the framework of statistical inference — the idea that you can learn the parameters of a model by maximizing how well it explains observed data. The connection between maximum likelihood and neural network loss functions (cross-entropy loss = negative log-likelihood) is direct.

What it contributed to DNNs: The entire training procedure. Loss functions are from statistics (Gauss/Fisher). The optimizer is from mathematics (Cauchy). Backpropagation is an application of the chain rule (Leibniz, 1676). The fundamental components of training a neural network are all centuries old — what was new was applying them to networks of artificial neurons.

2.4 Information Theory (1948)

The Information-Theoretic Root

Claude Shannon (1948) founded information theory with his landmark paper "A Mathematical Theory of Communication." He introduced entropy, mutual information, and the concept of channel capacity — a precise mathematical framework for quantifying information.

What it contributed to DNNs: Cross-entropy loss (the standard loss function for classification) is directly derived from Shannon's entropy. The concept of "bits" of information in a representation, the information bottleneck theory of deep learning (Tishby & Zaslavsky, 2015), and the connections between compression and prediction all flow from Shannon's work. Additionally, modern efficiency analysis (quantization, minimal description length) is grounded in information theory.

Four Intellectual Traditions Converging into Deep Learning Neuroscience Cajal, Hebb Neurons, synapses, learning as weight change Computation Theory Turing, McCulloch-Pitts Universal computation, formal neuron model Statistics & Optimization Gauss, Cauchy, Fisher Least squares, gradient descent, max. likelihood Information Theory Shannon Entropy, cross-entropy, compression Deep Neural Networks
Figure 2.1. Deep neural networks are not a single invention. They are the convergence of four intellectual traditions developed across two centuries. The neuron model comes from neuroscience. The formal computation framework comes from mathematical logic. The training algorithm comes from statistics and optimization. The loss function comes from information theory.

§3 The Six Historical Eras of Neural Networks

The history of neural networks is not a steady upward climb. It is a story of dramatic cycles — breakthrough followed by disappointment, winter followed by explosive revival. Understanding these cycles explains why certain ideas were "discovered" multiple times, why the field has recurring tensions between different schools of thought, and why progress was not smooth.

3.1 Era I — The McCulloch-Pitts Neuron and the Perceptron (1943–1969)

1943 McCulloch & Pitts — "A Logical Calculus of Ideas Immanent in Nervous Activity"
First mathematical model of an artificial neuron. Binary threshold units that can compute any logical function. Established that neural networks are universal computers.
1949 Donald Hebb — The Organization of Behavior
Proposed the first biologically-motivated learning rule: strengthen connections between co-active neurons. Conceptual ancestor of weight updates.
1957 Frank Rosenblatt — The Perceptron
First learning algorithm for a neural network. The perceptron could learn to classify linearly separable patterns from data — no explicit programming needed. Built as hardware (the Mark I Perceptron). Generated enormous excitement.
1960 Widrow & Hoff — ADALINE
Adaptive Linear Neuron. Used least mean squares (LMS) for training — essentially gradient descent. More mathematically principled than Rosenblatt's rule.
1969 Minsky & Papert — Perceptrons
Published a rigorous analysis proving that single-layer perceptrons cannot learn XOR or any function that is not linearly separable. This was mathematically correct, but the field interpreted it as proving that neural networks in general were a dead end — triggering the first AI winter for neural networks.

3.2 Era II — The First AI Winter (1969–1986)

Funding for neural network research dried up dramatically after Minsky and Papert's critique. The field shifted to symbolic AI (expert systems, logic programming). However, several crucial developments happened quietly during this "winter":

1974 Paul Werbos — Backpropagation (PhD thesis)
Derived the backpropagation algorithm for training multi-layer networks. This directly addressed Minsky & Papert's critique — multi-layer networks can learn XOR, and backprop is how to train them. But the work went largely unnoticed for over a decade.
1980 Kunihiko Fukushima — Neocognitron
First neural network with convolutional structure and pooling, inspired by Hubel & Wiesel's work on the visual cortex. Direct ancestor of CNNs. Not trained by backprop.
1982 John Hopfield — Hopfield Networks
Introduced recurrent neural networks with an energy function, connecting neural networks to statistical physics. Helped restore academic credibility.

3.3 Era III — Backpropagation and the Connectionist Revival (1986–1995)

1986 Rumelhart, Hinton & Williams — "Learning Representations by Back-Propagating Errors"
JUNCTION MOMENT. Published backpropagation in Nature, making it accessible to the broad scientific community. Demonstrated that multi-layer networks could learn useful internal representations. Triggered a renaissance in neural network research.
1989 Yann LeCun — Convolutional Neural Networks (LeNet)
Combined Fukushima's convolutional architecture with Rumelhart's backpropagation algorithm. Trained on handwritten digits. First practical CNN. Used commercially by US Postal Service.
1989 Cybenko, Hornik — Universal Approximation Theorem
Proved that a single hidden layer neural network with sufficient width can approximate any continuous function. Powerful theoretical result, but said nothing about how to learn such a network — and in practice, shallow wide networks were hard to train.
1991 Sepp Hochreiter — Vanishing Gradient Problem
Identified the fundamental obstacle to training deep networks: gradients shrink exponentially with depth, preventing learning in early layers. This would not be solved for another 19 years (ReLU, 2010).
1997 Hochreiter & Schmidhuber — LSTM
Long Short-Term Memory networks: a recurrent architecture with gating mechanisms that solved the vanishing gradient problem for sequences. Enabled learning long-range dependencies. Dominant sequence model for the next 20 years.

3.4 Era IV — The Kernel Machines Interlude (1995–2006)

Despite the backpropagation revival, neural networks gradually fell out of favor in the late 1990s. The reasons were practical: training was unstable, results were not reproducible, and there was little theoretical understanding of when or why they worked. Meanwhile, Support Vector Machines (SVMs) offered strong theoretical guarantees, well-defined optimization (convex problems with unique global minima), and competitive empirical results. For about a decade, SVMs and kernel methods dominated machine learning research.

Why Neural Networks Lost (Temporarily)

The issues were all practical, not theoretical. In order:

(1) Vanishing gradients prevented training deep networks (only 2–3 layers were practical).

(2) No good initialization — random initialization often led to convergence failures.

(3) Insufficient compute — GPUs were not yet used for neural network training.

(4) Insufficient data — large labeled datasets like ImageNet did not yet exist.

(5) No batch normalization — training was highly sensitive to hyperparameters.

Every one of these problems would be solved between 2006 and 2015.

3.5 Era V — The Deep Learning Revolution (2006–2017)

2006 Hinton, Osindero & Teh — Deep Belief Networks
Showed that deep networks could be pre-trained layer by layer using unsupervised learning (Restricted Boltzmann Machines), then fine-tuned with backprop. First demonstration that deep networks (>3 layers) could be trained at all. Coined the term "deep learning."
2009 Deng et al. — ImageNet
Created a dataset of 14 million labeled images in 21,000+ categories. Provided the data fuel that deep learning needed. The associated ILSVRC competition became the benchmark that drove progress.
2010 Nair & Hinton — Rectified Linear Units (ReLU)
Showed that ReLU activation ($\max(0, x)$) dramatically improved training of deep networks by avoiding the vanishing gradient problem. Simple, computationally cheap, and transformative. Made deep unsupervised pre-training unnecessary.
2012 Krizhevsky, Sutskever & Hinton — AlexNet
JUNCTION MOMENT. Won the ImageNet competition by a massive margin (26.2% → 16.4% top-5 error). Combined: deep CNN architecture + ReLU + dropout + GPU training. Proved that deep learning outperforms all prior approaches on large-scale vision. Triggered the current deep learning era.
2014 GoogLeNet / Inception & VGGNet
Showed that going deeper (16–22 layers) continued to improve accuracy. Introduced 1×1 convolutions for efficiency (Inception) and exclusive 3×3 kernels (VGG).
2014 Goodfellow et al. — GANs
Generative Adversarial Networks: two networks trained against each other. Opened the door to generative AI.
2015 He et al. — ResNet
Residual connections enabled training networks with 152+ layers. Solved the degradation problem (deeper networks performing worse, even on training data). Skip connections became a universal component.
2015 Ioffe & Szegedy — Batch Normalization
Normalized intermediate activations during training. Stabilized training, enabled larger learning rates, reduced sensitivity to initialization. Another universally adopted component.

3.6 Era VI — The Transformer Era and Foundation Models (2017–Present)

2017 Vaswani et al. — "Attention Is All You Need"
JUNCTION MOMENT. Replaced recurrent computation entirely with self-attention, enabling full parallelization. The Transformer architecture would become the foundation of modern NLP, and increasingly vision and other domains.
2018 Devlin et al. — BERT; Radford et al. — GPT
BERT: bidirectional pre-training via masked language modeling. GPT: unidirectional pre-training via next-token prediction. Both demonstrated that pre-training on massive unlabeled text, then fine-tuning, dominates task-specific training. Birth of the "foundation model" paradigm.
2020 Kaplan et al. — Scaling Laws
Showed that Transformer loss follows predictable power laws in model size, dataset size, and compute. Converted model design from art to engineering — compute-budget planning became systematic.
2020 Brown et al. — GPT-3 (175B parameters)
Demonstrated that scale alone produces qualitatively new capabilities: in-context learning, few-shot reasoning, emergent abilities. Shifted the field toward scale as the primary research variable.
2020 Dosovitskiy et al. — Vision Transformer (ViT)
Applied the Transformer architecture to images (splitting images into patches treated as tokens). Showed that Transformers can match or beat CNNs for vision, given sufficient data.
2022 Hoffmann et al. — Chinchilla Scaling Laws
Showed that prior large models were undertrained: compute-optimal training requires ~20 tokens per parameter.

§4 The Key Junctions — Where Separate Ideas Converged

The history of deep learning is not just a list of inventions. It is a story of convergences — moments when ideas from different fields came together to produce something greater than the sum of its parts. These "junctions" are the most intellectually interesting points in the timeline.

4.1 Junction 1: Neuroscience × Mathematical Optimization = Perceptron (1957)

Junction 1

From neuroscience: The idea that the brain computes via networks of neurons that have weighted connections and threshold-based activation.

From optimization: The idea that you can minimize an error function by iteratively adjusting parameters (gradient descent / error-correcting rules).

The synthesis: Rosenblatt's perceptron — a network of artificial neurons that learns its weights from data by minimizing classification errors. For the first time, a machine could learn to recognize patterns without being explicitly programmed.

4.2 Junction 2: Chain Rule × Automatic Differentiation × Multi-Layer Networks = Backpropagation (1986)

Junction 2

From calculus: The chain rule (Leibniz, 1676) — derivatives of composed functions can be computed by multiplying the derivatives of each component.

From control theory: Automatic differentiation / dynamic programming applied to multi-stage systems (Werbos, 1974).

From connectionism: Multi-layer neural networks that can represent non-linear functions (addressing Minsky & Papert's critique).

The synthesis: Rumelhart, Hinton & Williams (1986) showed that the chain rule, applied systematically through a multi-layer network, gives an efficient algorithm (backpropagation) for computing the gradient of the loss with respect to every weight — making it possible to train deep networks by gradient descent. This is arguably the single most important synthesis in the history of the field.

4.3 Junction 3: Statistical Learning Theory × Convolutions × GPUs = Deep CNNs (2012)

Junction 3

From signal processing: Convolution operations for shift-equivariant feature extraction (used in image processing since the 1960s).

From statistical learning: Large labeled datasets (ImageNet) enabling generalization of complex models.

From hardware engineering: GPUs (originally for graphics rendering) repurposed for massively parallel matrix multiplication.

From prior NN research: ReLU activation (solving vanishing gradients), dropout (regularization), batch normalization (training stability).

The synthesis: AlexNet (2012) combined all of these. The "deep learning revolution" was not a single breakthrough — it was the simultaneous maturation of architecture (CNNs + ReLU), data (ImageNet), hardware (GPUs), and training techniques (dropout). Remove any one of these, and AlexNet would not have worked.

4.4 Junction 4: Attention × Parallel Computation × Scale = Transformers (2017)

Junction 4

From neural machine translation: The attention mechanism (Bahdanau et al., 2015) — allowing the decoder to "look at" any part of the input, weighted by learned relevance scores.

From the parallelism problem: RNNs are inherently sequential, wasting GPU parallelism. The desire for fully parallel sequence processing.

From representation learning: The idea that the same input should be represented differently depending on context (contextual embeddings).

The synthesis: Vaswani et al. (2017) eliminated recurrence entirely, replacing it with self-attention — every position attends to every other position in parallel. The Transformer architecture enabled: (1) full parallelization during training, (2) direct modeling of long-range dependencies, and (3) easy scaling to billions of parameters. When combined with massive data and compute (GPT-3), this produced qualitatively new capabilities.


§5 Commonly Confused Concepts — Precise Distinctions

5.1 AI vs. ML vs. DL

Aspect AI ML DL
Defining characteristic Goal: intelligent behavior Method: learns from data Tool: deep neural networks
Requires data? Not necessarily (rules can be hand-coded) Yes (by definition) Yes (typically very large amounts)
Requires neural networks? No No Yes (by definition)
Feature engineering? Varies Usually manual Automatic (learned end-to-end)
Example without the subset Chess engine (minimax search) Random forest, SVM
Temporal scope Since 1956 Since 1959 Since 2006 (term), practice since ~2012

5.2 Learning Paradigms

Paradigm Data Signal Example DL Example
Supervised $(x, y)$ pairs Correct label for each input Email spam detection ImageNet classification (ResNet)
Unsupervised $x$ only (no labels) Structure in data itself Customer segmentation Autoencoders, GANs
Self-Supervised $x$ only, but labels derived from $x$ Predict part of input from another part BERT (mask prediction), GPT (next-token prediction)
Reinforcement State-action-reward sequences Delayed reward signal Game playing AlphaGo, RLHF for LLMs
Self-Supervised Learning — The Key Modern Paradigm

Self-supervised learning (SSL) is the paradigm behind the most impactful recent models (GPT, BERT, CLIP, DINO). It is sometimes called "unsupervised" in older literature, but it is more precise to call it self-supervised because there are labels — they're just derived automatically from the data itself (e.g., the next word in a sentence, a masked word, or a rotated image). SSL enables training on virtually unlimited data without human annotation, which is what enables the scale of modern foundation models.

5.3 Models vs. Algorithms vs. Architectures vs. Frameworks

Four Levels of Abstraction

Architecture: The structural design — how layers are arranged, what operations each layer performs, how information flows. Examples: ResNet-50, Transformer, U-Net. An architecture is a template — it has no learned parameters yet.

Model: A specific architecture with specific learned parameter values. A model is the result of training. "GPT-4" is a model — it is a specific Transformer architecture with specific trained weights. Two different training runs of the same architecture produce two different models.

Algorithm: A procedure for accomplishing a task. "Backpropagation" is an algorithm (for computing gradients). "Adam" is an algorithm (for updating parameters). "Gradient descent" is an algorithm. Algorithms are not learned; they are designed.

Framework: A software library that implements architectures, algorithms, and training infrastructure. Examples: PyTorch, TensorFlow, JAX. A framework is an engineering tool, not a mathematical concept.

5.4 Training vs. Inference vs. Fine-Tuning

Phase What Happens Parameters Change? Data Needed Cost (typical)
Training Learn parameters from scratch Yes (all) Large labeled dataset Very high (days–months on GPUs)
Inference Use trained model on new inputs No One input at a time Low (milliseconds per input)
Fine-Tuning Adapt pre-trained model to new task Yes (some or all) Small task-specific dataset Moderate (hours–days)
Pre-Training Train on large general corpus Yes (all) Massive unlabeled data Extremely high

5.5 Parameters vs. Hyperparameters vs. Architecture Choices

Concept Set By When Examples
Parameters ($\bm{\theta}$) Learning algorithm (gradient descent) During training Weights, biases in all layers
Hyperparameters Human (or automated search) Before training Learning rate, batch size, dropout rate, weight decay
Architecture choices Human (or NAS) Before training Number of layers, layer widths, kernel sizes, activation function
The Distinction Matters for Efficiency

Parameters are what the network learns. Their count ($P$) determines model memory ($4P$ bytes at FP32) and most of the FLOPs. Hyperparameters determine how well the parameters are learned — a bad learning rate can waste 10× the training FLOPs by requiring 10× more epochs. Architecture choices determine the structure of computation — the same parameter budget can be 10× more efficient as a MobileNet than as a VGG. All three levels independently affect total training cost.


§6 The Prerequisite Knowledge Map

Understanding deep neural networks at the level of detail in Parts I–VI of this series requires knowledge from several areas of mathematics and computer science. Here is a precise map of what is needed and where it connects.

Prerequisite Knowledge Map for Deep Learning Linear Algebra Vectors, matrices, norms Calculus Derivatives, chain rule Probability & Stats Distributions, MLE, Bayes Programming Python, NumPy CS Basics Complexity, memory Part I: Math Foundations Gradient descent, AD, softmax, CE Part II: MLPs Neurons, layers, backprop, activations Part III: CNNs Convolution, pooling, architectures Part IV: Training Loss, optimizers, regularization, BN Part V: Efficiency FLOPs, memory, hardware, scaling Part VI: Transformers, RNNs, Frontiers
Figure 6.1. Prerequisite map for the six-part series. Linear algebra, calculus, and probability flow into Part I. Parts II through IV build sequentially. Part V draws on CS fundamentals (complexity, memory models). Part VI synthesizes everything.

§ References

Foundational Works

[1] McCulloch, W. S. & Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.

[2] Hebb, D. O. (1949). The Organization of Behavior. Wiley.

[3] Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.

[4] Minsky, M. & Papert, S. (1969). Perceptrons. MIT Press.

Backpropagation and Revival

[5] Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University.

[6] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323, 533–536.

Deep Learning Revolution

[7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18(7), 1527–1554.

[8] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012.

[9] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521, 436–444.

Transformer and Foundation Models

[10] Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.

[11] Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019.

[12] Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020.

Other

[13] Mitchell, T. (1997). Machine Learning. McGraw-Hill. — The standard definition of ML.

[14] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

[15] Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423.

[16] Turing, A. (1936). On Computable Numbers. Proc. London Mathematical Society, 42(1), 230–265.

[17] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich. — First identification of vanishing gradients.

[18] Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.


Reading Order

Part 0 (this document) → Part I (Mathematical Foundations) → Part II (MLPs) → Part III (CNNs) → Part IV (Training) → Part V (Efficiency) → Part VI (Advanced Topics)

Back to Notebook Index
Total visits:
§
Page visits: