Skip to main content

Command Palette

Search for a command to run...

The 4 page paper that powers every AI you use

Reading the 1986 backpropagation paper as a complete beginner

Updated
4 min read
The 4 page paper that powers every AI you use
S
Day 3: I will add a loss curve, visualize the gradients in each layer, and push the full code to GitHub. You will see why gradients in early layers are much smaller than gradients in late layers. This matters for every deep model that exists.

I am reading the classic research papers that built modern AI. One paper per week. I implement everything by hand in Python. No shortcuts.

This is Day 1.

The paper is called "Learning Representations by Back-propagating Errors." It was written in 1986 by David Rumelhart, Geoffrey Hinton, and Ronald Williams. It is 4 pages long.

ChatGPT uses this. Gemini uses this. DeepSeek uses this. Every neural network trained today uses the exact same idea from this paper.

Here is what it says, in simple words.


What is a neural network?

Think of it like a chain of small decisions.

You give the network some numbers as input. Maybe the pixels of an image. The network passes those numbers through layers of small units called neurons. Each neuron does simple math: it multiplies its inputs by some weights, adds them up, and produces one output number. At the end, the network produces a prediction.

The problem in 1986 was this. Nobody knew how to teach the neurons in the middle layers. The input and output layers were easy. The hidden layers in the middle were a mystery.

This paper solved that mystery.


The one idea you need to understand

Imagine you take a test and get a score. You want to know: which specific answers caused me to lose marks?

You go through the test backwards. Question 10 cost me 2 marks. Question 7 cost me 5 marks. Now I know where to study.

Backpropagation does the same thing for a neural network. After the network makes a prediction, you measure the error. Then you go backwards through the network and ask: how much did each weight contribute to this error?

The math that makes this possible is called the chain rule.


The chain rule in one sentence

If A affects B, and B affects C, then the effect of A on C is just the two effects multiplied together.

That is it. The whole algorithm is built on this one idea, applied many times across many layers.


Why XOR was the proof

The authors used a problem called XOR to show the algorithm works.

XOR is simple. You give it two inputs. If they are the same (both 0 or both 1), the answer is 0. If they are different, the answer is 1.

A single layer network cannot solve this. No single straight line can separate the four points correctly. You need a hidden layer to bend the space.

Before this paper, nobody knew how to train that hidden layer. The paper showed that backpropagation could do it, and the network learned to solve XOR on its own.


The full training loop in plain words

Step 1. Set all weights to small random numbers.

Step 2. Feed one training example through the network. Get a prediction.

Step 3. Measure how wrong the prediction was. This is the error.

Step 4. Go backwards through every layer. Use the chain rule to compute how much each weight contributed to the error.

Step 5. Nudge each weight a tiny amount in the direction that reduces the error. The size of the nudge is controlled by a number called the learning rate.

Step 6. Repeat steps 2 to 5 thousands of times until the error is small enough.

That is backpropagation. That is what trains every AI model in the world today.


What shocked me about this paper

The idea is simple. The math is not hard. The paper is only 4 pages.

And yet before 1986, the best researchers in the world thought you could not train networks with hidden layers. This paper proved them wrong and started a chain of events that led directly to ChatGPT, image generation, voice assistants, everything.

The lesson: sometimes the most important ideas are simple ones that nobody had bothered to write down clearly.