Neural Network from Scratch

A pure NumPy implementation of a two-layer neural network trained on MNIST handwritten digit classification. No frameworks, no autograd -- just matrix math and manually derived gradients.

Paper: The Core Mathematics of a Neural Network | Video | Data

Overview

This project implements a complete neural network training pipeline using only NumPy: forward propagation, backward propagation, and gradient descent. Every gradient is derived by hand and implemented as explicit matrix operations. The goal is to understand exactly what happens inside a neural network, without any abstraction layers.

Results: ~87.7% training accuracy, ~85.8% test accuracy on MNIST after 750 iterations.

Architecture

Input (784 pixels) -> Linear -> ReLU (10 units) -> Linear -> Softmax (10 classes) -> Prediction

Parameter	Shape	Count
W1	(10, 784)	7,840
b1	(10, 1)	10
W2	(10, 10)	100
b2	(10, 1)	10
Total		7,960

The hidden layer has only 10 neurons -- deliberately minimal to force the network to learn compact representations and to keep the math tractable for manual derivation.

How It Works

Forward Propagation

Z1 = W1 * X + b1          # Linear transform (10, 784) x (784, m) = (10, m)
A1 = ReLU(Z1)             # Activation: max(0, z) element-wise
Z2 = W2 * A1 + b2         # Linear transform (10, 10) x (10, m) = (10, m)
A2 = Softmax(Z2)          # Probability distribution over 10 classes

Backward Propagation

dZ2 = A2 - Y_onehot       # Output error (softmax + cross-entropy derivative)
dW2 = (1/m) * dZ2 * A1.T  # Weight gradient for layer 2
db2 = (1/m) * sum(dZ2)    # Bias gradient for layer 2
dZ1 = W2.T * dZ2 * ReLU'(Z1)  # Backpropagated error through ReLU
dW1 = (1/m) * dZ1 * X.T   # Weight gradient for layer 1
db1 = (1/m) * sum(dZ1)    # Bias gradient for layer 1

Parameter Update

W = W - alpha * dW         # Gradient descent (alpha = learning rate)
b = b - alpha * db

Data Preprocessing

Three critical details that affect whether training succeeds at all:

Transposition: Data is transposed so each column is one example (x_train.shape = (784, 41000)). This makes the matrix math cleaner: W * X naturally produces activations in column-per-example format.
Normalization: Pixel values are divided by 255 to map from [0, 255] to [0, 1]. Without this, weight updates explode and produce NaN values within a few iterations.
One-hot encoding: Labels are converted from scalars (e.g., 7) to one-hot vectors (e.g., [0,0,0,0,0,0,0,1,0,0]) for computing the output error.

Usage

Setup

conda create --name NN-FS python=3.8
conda activate NN-FS
pip install numpy pandas matplotlib

Data

Download train.csv from Kaggle MNIST and place it in a mnist_data/ folder:

├── Neural_Network_from_Scratch.py
├── Neural_Network_from_Scratch.ipynb
└── mnist_data/
    └── train.csv

Run

python Neural_Network_from_Scratch.py

The script will:

Load and preprocess MNIST data (41,000 train / 1,000 test split)
Train for 750 iterations with learning rate 0.1
Plot loss and accuracy curves
Display predictions on individual test images
Print final test set accuracy

Key Implementation Details

sum() vs np.sum() in Softmax: The implementation uses Python's built-in sum() rather than np.sum() so that the sum is taken over individual columns (examples) rather than the entire matrix. This is a subtle but critical distinction.
Weight initialization: Weights are initialized from np.random.rand() - 0.5, giving values in [-0.5, 0.5]. This keeps initial activations small enough to avoid saturation.
No batch processing: The full training set is used in every gradient computation (batch gradient descent, not mini-batch or stochastic).

Known Limitations

10-neuron bottleneck: The hidden layer with only 10 neurons limits representational capacity. A 128 or 256 neuron hidden layer would improve accuracy significantly.
No mini-batching: Uses the full dataset for every gradient update, which is slower to converge than mini-batch SGD.
No regularization: No dropout, weight decay, or other regularization techniques.
Fixed learning rate: No learning rate scheduling or adaptive optimizers (Adam, etc.).
Single hidden layer: A deeper network would capture more complex features.

Companion Paper

The repository includes a companion paper, "Neural Networks From Scratch: The Core Mathematics," which provides formal derivations of every gradient used in the backward propagation step. The paper covers:

The chain rule applied to matrix operations
Derivation of the softmax + cross-entropy gradient
Why ReLU's derivative is simply a binary mask
How the (1/m) normalization factor ensures stable gradient magnitudes

Project Context

Built as an educational project to understand the core mathematics of neural networks without relying on autograd or framework abstractions. Pairs with sys_micrograd, which takes the complementary approach of building an autograd engine that handles the gradient computation automatically.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
NN_From_Scratch_The Core Mathematics.pdf		NN_From_Scratch_The Core Mathematics.pdf
Neural_Network_from_Scratch.ipynb		Neural_Network_from_Scratch.ipynb
Neural_Network_from_Scratch.py		Neural_Network_from_Scratch.py
README.md		README.md
nn_fs.gif		nn_fs.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Network from Scratch

Paper: The Core Mathematics of a Neural Network | Video | Data

Overview

Architecture

How It Works

Forward Propagation

Backward Propagation

Parameter Update

Data Preprocessing

Usage

Setup

Data

Run

Key Implementation Details

Known Limitations

Companion Paper

Project Context

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neural Network from Scratch

Paper: The Core Mathematics of a Neural Network | Video | Data

Overview

Architecture

How It Works

Forward Propagation

Backward Propagation

Parameter Update

Data Preprocessing

Usage

Setup

Data

Run

Key Implementation Details

Known Limitations

Companion Paper

Project Context

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages