Deep neural networks did not appear from nowhere. They sit at the intersection of at least four major intellectual traditions — neuroscience, mathematical optimization, statistical learning theory, and computer hardware engineering — each of which developed over decades or centuries before converging into what we now call "deep learning." Without understanding these roots, the field appears as a disconnected collection of clever tricks rather than what it actually is: a deeply coherent system of ideas with clear intellectual ancestry.
There is a second problem. The terms "artificial intelligence," "machine learning," "deep learning," "neural networks," and many others are used loosely, inconsistently, and often interchangeably in popular discourse, in journalism, in marketing, and even in some academic papers. A student encountering this field for the first time faces a tangle of overlapping vocabulary with no authoritative map. This document provides that map.
This is Part 0 — it should be read before the six technical parts that follow. It contains no equations that aren't also in the later parts. Its purpose is entirely conceptual: to establish where the ideas came from, how they relate to each other, and what the words mean.
The relationship between AI, ML, DL, and DNNs is one of strict containment: each is a subset of the one above it. This hierarchy is not a matter of opinion — it follows from the definitions. Understanding it eliminates the most common source of confusion in the field.
AI is the broadest category. It encompasses any technique that enables a machine to mimic, simulate, or exhibit behavior that would be considered "intelligent" if performed by a human. This includes rule-based expert systems, search algorithms, logic programming, planning systems, robotics controllers, and machine learning — among many other approaches.
The key characteristic of AI: It is defined by the goal (intelligent behavior), not by the method. A hand-coded chess engine with millions of if-else rules is AI. A decision tree trained on data is AI. A deep neural network is AI. They are all AI because they all produce intelligent behavior — but they use completely different methods.
The term was coined by John McCarthy in 1956 at the Dartmouth Conference, where it was defined as "the science and engineering of making intelligent machines." This definition is deliberately broad — it includes approaches that have nothing to do with learning from data, such as symbolic reasoning, search algorithms (A*, minimax), expert systems (MYCIN, DENDRAL), and formal logic systems. These "classical AI" or "Good Old-Fashioned AI" (GOFAI) methods dominated the field from the 1950s through the 1980s and remain important in specific domains (planning, scheduling, formal verification).
ML is a subset of AI. It comprises methods where the system learns from data rather than being explicitly programmed with rules. Tom Mitchell's (1997) definition is the most precise: "A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$."
The key characteristic of ML: It is defined by the method — learning from data. The system improves its performance by being exposed to more data, without a human explicitly coding the rules for each new situation.
Machine learning encompasses a vast landscape of algorithms that have nothing to do with neural networks. The major families include:
| ML Family | Core Idea | Key Algorithms | Relation to DNN |
|---|---|---|---|
| Linear Models | Output is a linear function of inputs | Linear regression, logistic regression, perceptron | A single DNN neuron is a linear model + activation |
| Tree-Based | Partition input space via hierarchical decisions | Decision trees, random forests, gradient boosting (XGBoost) | Completely different paradigm — no gradient-based learning |
| Kernel Methods | Implicit mapping to high-dimensional space | SVM, Gaussian processes | Competed with DNNs in 1995–2012; now largely superseded for perception tasks |
| Instance-Based | No explicit model; compare to stored examples | k-NN, locally weighted regression | No overlap with DNNs |
| Probabilistic Graphical Models | Explicit probabilistic dependencies between variables | Bayesian networks, HMMs, CRFs | DNNs can be combined with graphical models (e.g., VAEs) |
| Ensemble Methods | Combine many weak learners | Bagging, boosting, stacking | Still dominant for tabular data; DNNs dominate for perception |
| Neural Networks | Learn hierarchical representations via differentiable functions | MLPs, CNNs, RNNs, Transformers | This is the family we study in Parts I–VI |
Machine learning is NOT synonymous with deep learning. Most ML algorithms have nothing to do with neural networks. Random forests, SVMs, k-NN, and gradient boosting are all ML — and they remain the best choice for many problems (particularly tabular/structured data). Deep learning is one branch of ML that happens to have achieved spectacular success on perception tasks (vision, language, speech) — but it is a branch, not the trunk.
DL is a subset of ML. It refers specifically to machine learning using neural networks with multiple layers (typically more than 2 hidden layers). The "deep" refers to the depth of the network — the number of sequential processing layers between input and output.
The key characteristic of DL: It is defined by both the method (learning from data) and the tool (deep neural networks). The crucial innovation is representation learning — the network automatically discovers the relevant features from raw data, rather than requiring human-engineered features.
The distinction between "shallow" ML and "deep" learning is fundamentally about feature engineering. In traditional ML, a human expert must design features (e.g., SIFT descriptors for images, TF-IDF for text, spectral features for audio). The ML algorithm then learns a mapping from these hand-crafted features to the output. In deep learning, the network learns both the features and the mapping — the early layers learn to extract low-level features, the middle layers combine them into higher-level concepts, and the final layers map these to the output. This end-to-end learning is what makes deep learning qualitatively different from previous ML approaches.
A neural network is any parameterized function composed of layers of artificial neurons (linear transformations followed by non-linear activations), trained by gradient-based optimization.
A deep neural network (DNN) is a neural network with "many" layers. There is no universally agreed-upon threshold for "deep," but in practice:
Shallow: 1–2 hidden layers (classic MLPs, single-layer perceptrons)
Deep: 3+ hidden layers (the universal approximation theorem guarantees that even shallow networks can represent any function, but deep networks represent the same functions with exponentially fewer parameters)
The practical motivation for depth is not theoretical expressiveness but efficiency: deep networks learn hierarchical representations that reuse features across the hierarchy, requiring far fewer parameters than shallow networks for the same task.
Deep neural networks draw on at least four distinct intellectual traditions that developed largely independently before converging. Understanding these roots reveals that DNNs are not a single invention but a synthesis — and this synthesis is what makes the field both powerful and confusing.
Santiago Ramón y Cajal (1890s) established the "neuron doctrine" — that the nervous system is composed of discrete cells (neurons) that communicate via electrochemical signals across synapses. This was the first scientific understanding that the brain is a network of computing units.
Donald Hebb (1949) proposed Hebb's rule: "Neurons that fire together wire together." This was the first learning rule for neural connections — the idea that the strength of a synapse should increase when both the pre-synaptic and post-synaptic neurons are active simultaneously. Hebb's rule is the conceptual ancestor of weight updates in modern neural networks.
What it contributed to DNNs: The fundamental metaphor — computation as a network of simple processing units connected by weighted links, where learning changes the weights. This metaphor shaped the entire field's vocabulary (neurons, layers, weights, activation, etc.).
Alan Turing (1936) formalized the concept of computation itself with the Turing machine. His work showed that any computable function can be realized by a sufficiently complex machine — establishing the theoretical possibility that machines can perform any intellectual task that humans can describe algorithmically.
Warren McCulloch and Walter Pitts (1943) combined Turing's formalism with Cajal's neuron doctrine. They showed that networks of simplified neurons (binary threshold units) could compute any logical function computable by a Turing machine. This was the birth of artificial neural networks as a concept — the first demonstration that brain-like computing elements could, in principle, perform arbitrary computation.
What it contributed to DNNs: The theoretical foundation — neural networks as universal computers. Also the crucial simplification: real biological neurons are enormously complex, but their essential computational behavior can be captured by a simple mathematical model (weighted sum + threshold/activation function).
Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1809) developed the method of least squares — fitting a model to data by minimizing the sum of squared errors. This is the direct ancestor of the loss function in neural network training.
Augustin-Louis Cauchy (1847) described gradient descent — the iterative algorithm for finding the minimum of a function by taking steps proportional to the negative gradient. This is still the core algorithm used to train every neural network today.
Ronald Fisher (1922, 1936) formalized maximum likelihood estimation and developed the framework of statistical inference — the idea that you can learn the parameters of a model by maximizing how well it explains observed data. The connection between maximum likelihood and neural network loss functions (cross-entropy loss = negative log-likelihood) is direct.
What it contributed to DNNs: The entire training procedure. Loss functions are from statistics (Gauss/Fisher). The optimizer is from mathematics (Cauchy). Backpropagation is an application of the chain rule (Leibniz, 1676). The fundamental components of training a neural network are all centuries old — what was new was applying them to networks of artificial neurons.
Claude Shannon (1948) founded information theory with his landmark paper "A Mathematical Theory of Communication." He introduced entropy, mutual information, and the concept of channel capacity — a precise mathematical framework for quantifying information.
What it contributed to DNNs: Cross-entropy loss (the standard loss function for classification) is directly derived from Shannon's entropy. The concept of "bits" of information in a representation, the information bottleneck theory of deep learning (Tishby & Zaslavsky, 2015), and the connections between compression and prediction all flow from Shannon's work. Additionally, modern efficiency analysis (quantization, minimal description length) is grounded in information theory.
The history of neural networks is not a steady upward climb. It is a story of dramatic cycles — breakthrough followed by disappointment, winter followed by explosive revival. Understanding these cycles explains why certain ideas were "discovered" multiple times, why the field has recurring tensions between different schools of thought, and why progress was not smooth.
Funding for neural network research dried up dramatically after Minsky and Papert's critique. The field shifted to symbolic AI (expert systems, logic programming). However, several crucial developments happened quietly during this "winter":
Despite the backpropagation revival, neural networks gradually fell out of favor in the late 1990s. The reasons were practical: training was unstable, results were not reproducible, and there was little theoretical understanding of when or why they worked. Meanwhile, Support Vector Machines (SVMs) offered strong theoretical guarantees, well-defined optimization (convex problems with unique global minima), and competitive empirical results. For about a decade, SVMs and kernel methods dominated machine learning research.
The issues were all practical, not theoretical. In order:
(1) Vanishing gradients prevented training deep networks (only 2–3 layers were practical).
(2) No good initialization — random initialization often led to convergence failures.
(3) Insufficient compute — GPUs were not yet used for neural network training.
(4) Insufficient data — large labeled datasets like ImageNet did not yet exist.
(5) No batch normalization — training was highly sensitive to hyperparameters.
Every one of these problems would be solved between 2006 and 2015.
The history of deep learning is not just a list of inventions. It is a story of convergences — moments when ideas from different fields came together to produce something greater than the sum of its parts. These "junctions" are the most intellectually interesting points in the timeline.
From neuroscience: The idea that the brain computes via networks of neurons that have weighted connections and threshold-based activation.
From optimization: The idea that you can minimize an error function by iteratively adjusting parameters (gradient descent / error-correcting rules).
The synthesis: Rosenblatt's perceptron — a network of artificial neurons that learns its weights from data by minimizing classification errors. For the first time, a machine could learn to recognize patterns without being explicitly programmed.
From calculus: The chain rule (Leibniz, 1676) — derivatives of composed functions can be computed by multiplying the derivatives of each component.
From control theory: Automatic differentiation / dynamic programming applied to multi-stage systems (Werbos, 1974).
From connectionism: Multi-layer neural networks that can represent non-linear functions (addressing Minsky & Papert's critique).
The synthesis: Rumelhart, Hinton & Williams (1986) showed that the chain rule, applied systematically through a multi-layer network, gives an efficient algorithm (backpropagation) for computing the gradient of the loss with respect to every weight — making it possible to train deep networks by gradient descent. This is arguably the single most important synthesis in the history of the field.
From signal processing: Convolution operations for shift-equivariant feature extraction (used in image processing since the 1960s).
From statistical learning: Large labeled datasets (ImageNet) enabling generalization of complex models.
From hardware engineering: GPUs (originally for graphics rendering) repurposed for massively parallel matrix multiplication.
From prior NN research: ReLU activation (solving vanishing gradients), dropout (regularization), batch normalization (training stability).
The synthesis: AlexNet (2012) combined all of these. The "deep learning revolution" was not a single breakthrough — it was the simultaneous maturation of architecture (CNNs + ReLU), data (ImageNet), hardware (GPUs), and training techniques (dropout). Remove any one of these, and AlexNet would not have worked.
From neural machine translation: The attention mechanism (Bahdanau et al., 2015) — allowing the decoder to "look at" any part of the input, weighted by learned relevance scores.
From the parallelism problem: RNNs are inherently sequential, wasting GPU parallelism. The desire for fully parallel sequence processing.
From representation learning: The idea that the same input should be represented differently depending on context (contextual embeddings).
The synthesis: Vaswani et al. (2017) eliminated recurrence entirely, replacing it with self-attention — every position attends to every other position in parallel. The Transformer architecture enabled: (1) full parallelization during training, (2) direct modeling of long-range dependencies, and (3) easy scaling to billions of parameters. When combined with massive data and compute (GPT-3), this produced qualitatively new capabilities.
| Aspect | AI | ML | DL |
|---|---|---|---|
| Defining characteristic | Goal: intelligent behavior | Method: learns from data | Tool: deep neural networks |
| Requires data? | Not necessarily (rules can be hand-coded) | Yes (by definition) | Yes (typically very large amounts) |
| Requires neural networks? | No | No | Yes (by definition) |
| Feature engineering? | Varies | Usually manual | Automatic (learned end-to-end) |
| Example without the subset | Chess engine (minimax search) | Random forest, SVM | — |
| Temporal scope | Since 1956 | Since 1959 | Since 2006 (term), practice since ~2012 |
| Paradigm | Data | Signal | Example | DL Example |
|---|---|---|---|---|
| Supervised | $(x, y)$ pairs | Correct label for each input | Email spam detection | ImageNet classification (ResNet) |
| Unsupervised | $x$ only (no labels) | Structure in data itself | Customer segmentation | Autoencoders, GANs |
| Self-Supervised | $x$ only, but labels derived from $x$ | Predict part of input from another part | — | BERT (mask prediction), GPT (next-token prediction) |
| Reinforcement | State-action-reward sequences | Delayed reward signal | Game playing | AlphaGo, RLHF for LLMs |
Self-supervised learning (SSL) is the paradigm behind the most impactful recent models (GPT, BERT, CLIP, DINO). It is sometimes called "unsupervised" in older literature, but it is more precise to call it self-supervised because there are labels — they're just derived automatically from the data itself (e.g., the next word in a sentence, a masked word, or a rotated image). SSL enables training on virtually unlimited data without human annotation, which is what enables the scale of modern foundation models.
Architecture: The structural design — how layers are arranged, what operations each layer performs, how information flows. Examples: ResNet-50, Transformer, U-Net. An architecture is a template — it has no learned parameters yet.
Model: A specific architecture with specific learned parameter values. A model is the result of training. "GPT-4" is a model — it is a specific Transformer architecture with specific trained weights. Two different training runs of the same architecture produce two different models.
Algorithm: A procedure for accomplishing a task. "Backpropagation" is an algorithm (for computing gradients). "Adam" is an algorithm (for updating parameters). "Gradient descent" is an algorithm. Algorithms are not learned; they are designed.
Framework: A software library that implements architectures, algorithms, and training infrastructure. Examples: PyTorch, TensorFlow, JAX. A framework is an engineering tool, not a mathematical concept.
| Phase | What Happens | Parameters Change? | Data Needed | Cost (typical) |
|---|---|---|---|---|
| Training | Learn parameters from scratch | Yes (all) | Large labeled dataset | Very high (days–months on GPUs) |
| Inference | Use trained model on new inputs | No | One input at a time | Low (milliseconds per input) |
| Fine-Tuning | Adapt pre-trained model to new task | Yes (some or all) | Small task-specific dataset | Moderate (hours–days) |
| Pre-Training | Train on large general corpus | Yes (all) | Massive unlabeled data | Extremely high |
| Concept | Set By | When | Examples |
|---|---|---|---|
| Parameters ($\bm{\theta}$) | Learning algorithm (gradient descent) | During training | Weights, biases in all layers |
| Hyperparameters | Human (or automated search) | Before training | Learning rate, batch size, dropout rate, weight decay |
| Architecture choices | Human (or NAS) | Before training | Number of layers, layer widths, kernel sizes, activation function |
Parameters are what the network learns. Their count ($P$) determines model memory ($4P$ bytes at FP32) and most of the FLOPs. Hyperparameters determine how well the parameters are learned — a bad learning rate can waste 10× the training FLOPs by requiring 10× more epochs. Architecture choices determine the structure of computation — the same parameter budget can be 10× more efficient as a MobileNet than as a VGG. All three levels independently affect total training cost.
Understanding deep neural networks at the level of detail in Parts I–VI of this series requires knowledge from several areas of mathematics and computer science. Here is a precise map of what is needed and where it connects.
[1] McCulloch, W. S. & Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.
[2] Hebb, D. O. (1949). The Organization of Behavior. Wiley.
[3] Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.
[4] Minsky, M. & Papert, S. (1969). Perceptrons. MIT Press.
[5] Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University.
[6] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323, 533–536.
[7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18(7), 1527–1554.
[8] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012.
[9] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521, 436–444.
[10] Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.
[11] Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019.
[12] Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020.
[13] Mitchell, T. (1997). Machine Learning. McGraw-Hill. — The standard definition of ML.
[14] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
[15] Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423.
[16] Turing, A. (1936). On Computable Numbers. Proc. London Mathematical Society, 42(1), 230–265.
[17] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich. — First identification of vanishing gradients.
[18] Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
Reading Order
Part 0 (this document) → Part I (Mathematical Foundations) → Part II (MLPs) → Part III (CNNs) → Part IV (Training) → Part V (Efficiency) → Part VI (Advanced Topics)