Skip to main content

Command Palette

Search for a command to run...

The loss curve never lies. Here is what backpropagation looks like when it learns.

Updated
3 min read
The loss curve never lies. Here is what backpropagation looks like when it learns.
S
Day 3: I will add a loss curve, visualize the gradients in each layer, and push the full code to GitHub. You will see why gradients in early layers are much smaller than gradients in late layers. This matters for every deep model that exists.

Day 3: visualizing the 1986 paper and pushing to GitHub

Day 2 I wrote the code. Today I ran it, plotted everything, and pushed to Gitlab

Two charts. One big insight. One honest reflection.

If you missed Day 1 and Day 2


What the loss curve actually looks like

This is the most important chart in machine learning. Every training run produces one. Now you know how to read it.

The x axis is the training step. The y axis is the error. Lower is better.

Starting loss was 0.318. That is what you get with random weights. The network is essentially guessing.

For the first 2,000 steps almost nothing happens. The loss barely moves. This surprised me when I first saw it. The network looks stuck.

Then around step 2,500 something changes. The loss starts dropping. By step 5,000 it is at 0.010. By step 10,000 it is at 0.0023.

That is a 99.3 percent reduction from start to finish.

Why does it stay flat for so long? The network is in a flat region of the error surface. The gradients are tiny so the weight updates are tiny so nothing much happens. Then the weights reach a better position where the gradients get larger and suddenly the network finds the right direction. This is why training can feel like nothing is working and then suddenly it works.


The gradient sizes reveal a real problem

This is the chart that taught me the most.

The purple line is the gradient at W2, the output layer. The orange dashed line is the gradient at W1, the first layer.

W1 is always smaller than W2. At the end of training W2 has a gradient of 0.001217 while W1 has a gradient of 0.000590. W1 is receiving about half the signal that W2 gets.

This is the vanishing gradient problem. It is not a bug in our code. It is a fundamental property of backpropagation through sigmoid activations.

With two layers, half the signal is fine. The network still learns.

With 50 layers the first layer might receive a gradient of 0.000000001. It learns almost nothing. The deeper the network, the worse this gets.

This is why researchers spent 20 years after this 1986 paper trying to fix it. The solutions that eventually worked were ReLU activations, skip connections in ResNets, and batch normalization. Every one of those inventions exists because of what you are seeing in this chart right now.


What the code looks like after 3 days

The full repo is live on GitLab: paperbyhand/backpropagation-numpy-from-scratch

What is in it:

backprop.py: the clean 50 line implementation visualize.py: the code that generates the two charts above README.md: a short explanation of what the paper is and how to run the code

Open backprop.py first. Run it. You will see the same output I got. Then look at the two charts.


What I understand now that I did not three days ago

Before reading this paper I knew that neural networks learn from data. I did not understand how.

Now I know. You make a prediction. You measure the error. You use the chain rule to compute how much each weight contributed to that error. You vary each weight a tiny amount. You repeat.

That is it. That is the entire foundation of modern AI.

ChatGPT, Gemini, DeepSeek, Stable Diffusion. All of them start with this exact loop, written in this paper in 1986, which is four pages long.

Papers by Hand

Part 2 of 2

Reading the classic ML research papers that built modern AI. Each paper has 3 days: read and understand, implement from scratch, visualize and publish to Gitlab

Start from the beginning

I implemented backpropagation in 58 lines of NumPy. No libraries.

Day 2: the 1986 paper turned into actual code, line by line