A pure NumPy implementation of a two-layer neural network trained on MNIST handwritten digit classification. No frameworks, no autograd -- just matrix math and manually derived gradients.
This project implements a complete neural network training pipeline using only NumPy: forward propagation, backward propagation, and gradient descent. Every gradient is derived by hand and implemented as explicit matrix operations. The goal is to understand exactly what happens inside a neural network, without any abstraction layers.
Results: ~87.7% training accuracy, ~85.8% test accuracy on MNIST after 750 iterations.
Input (784 pixels) -> Linear -> ReLU (10 units) -> Linear -> Softmax (10 classes) -> Prediction
| Parameter | Shape | Count |
|---|---|---|
| W1 | (10, 784) | 7,840 |
| b1 | (10, 1) | 10 |
| W2 | (10, 10) | 100 |
| b2 | (10, 1) | 10 |
| Total | 7,960 |
The hidden layer has only 10 neurons -- deliberately minimal to force the network to learn compact representations and to keep the math tractable for manual derivation.
Z1 = W1 * X + b1 # Linear transform (10, 784) x (784, m) = (10, m)
A1 = ReLU(Z1) # Activation: max(0, z) element-wise
Z2 = W2 * A1 + b2 # Linear transform (10, 10) x (10, m) = (10, m)
A2 = Softmax(Z2) # Probability distribution over 10 classes
dZ2 = A2 - Y_onehot # Output error (softmax + cross-entropy derivative)
dW2 = (1/m) * dZ2 * A1.T # Weight gradient for layer 2
db2 = (1/m) * sum(dZ2) # Bias gradient for layer 2
dZ1 = W2.T * dZ2 * ReLU'(Z1) # Backpropagated error through ReLU
dW1 = (1/m) * dZ1 * X.T # Weight gradient for layer 1
db1 = (1/m) * sum(dZ1) # Bias gradient for layer 1
W = W - alpha * dW # Gradient descent (alpha = learning rate)
b = b - alpha * db
Three critical details that affect whether training succeeds at all:
-
Transposition: Data is transposed so each column is one example (
x_train.shape = (784, 41000)). This makes the matrix math cleaner:W * Xnaturally produces activations in column-per-example format. -
Normalization: Pixel values are divided by 255 to map from [0, 255] to [0, 1]. Without this, weight updates explode and produce NaN values within a few iterations.
-
One-hot encoding: Labels are converted from scalars (e.g.,
7) to one-hot vectors (e.g.,[0,0,0,0,0,0,0,1,0,0]) for computing the output error.
conda create --name NN-FS python=3.8
conda activate NN-FS
pip install numpy pandas matplotlibDownload train.csv from Kaggle MNIST and place it in a mnist_data/ folder:
├── Neural_Network_from_Scratch.py
├── Neural_Network_from_Scratch.ipynb
└── mnist_data/
└── train.csv
python Neural_Network_from_Scratch.pyThe script will:
- Load and preprocess MNIST data (41,000 train / 1,000 test split)
- Train for 750 iterations with learning rate 0.1
- Plot loss and accuracy curves
- Display predictions on individual test images
- Print final test set accuracy
-
sum()vsnp.sum()in Softmax: The implementation uses Python's built-insum()rather thannp.sum()so that the sum is taken over individual columns (examples) rather than the entire matrix. This is a subtle but critical distinction. -
Weight initialization: Weights are initialized from
np.random.rand() - 0.5, giving values in [-0.5, 0.5]. This keeps initial activations small enough to avoid saturation. -
No batch processing: The full training set is used in every gradient computation (batch gradient descent, not mini-batch or stochastic).
- 10-neuron bottleneck: The hidden layer with only 10 neurons limits representational capacity. A 128 or 256 neuron hidden layer would improve accuracy significantly.
- No mini-batching: Uses the full dataset for every gradient update, which is slower to converge than mini-batch SGD.
- No regularization: No dropout, weight decay, or other regularization techniques.
- Fixed learning rate: No learning rate scheduling or adaptive optimizers (Adam, etc.).
- Single hidden layer: A deeper network would capture more complex features.
The repository includes a companion paper, "Neural Networks From Scratch: The Core Mathematics," which provides formal derivations of every gradient used in the backward propagation step. The paper covers:
- The chain rule applied to matrix operations
- Derivation of the softmax + cross-entropy gradient
- Why ReLU's derivative is simply a binary mask
- How the (1/m) normalization factor ensures stable gradient magnitudes
Built as an educational project to understand the core mathematics of neural networks without relying on autograd or framework abstractions. Pairs with sys_micrograd, which takes the complementary approach of building an autograd engine that handles the gradient computation automatically.
MIT
