Skip to content

ShlomoStept/Neural_Networks_from_Scratch

Repository files navigation

Neural Network from Scratch

A pure NumPy implementation of a two-layer neural network trained on MNIST handwritten digit classification. No frameworks, no autograd -- just matrix math and manually derived gradients.

Overview

This project implements a complete neural network training pipeline using only NumPy: forward propagation, backward propagation, and gradient descent. Every gradient is derived by hand and implemented as explicit matrix operations. The goal is to understand exactly what happens inside a neural network, without any abstraction layers.

Results: ~87.7% training accuracy, ~85.8% test accuracy on MNIST after 750 iterations.

Architecture

Input (784 pixels) -> Linear -> ReLU (10 units) -> Linear -> Softmax (10 classes) -> Prediction
Parameter Shape Count
W1 (10, 784) 7,840
b1 (10, 1) 10
W2 (10, 10) 100
b2 (10, 1) 10
Total 7,960

The hidden layer has only 10 neurons -- deliberately minimal to force the network to learn compact representations and to keep the math tractable for manual derivation.

How It Works

Forward Propagation

Z1 = W1 * X + b1          # Linear transform (10, 784) x (784, m) = (10, m)
A1 = ReLU(Z1)             # Activation: max(0, z) element-wise
Z2 = W2 * A1 + b2         # Linear transform (10, 10) x (10, m) = (10, m)
A2 = Softmax(Z2)          # Probability distribution over 10 classes

Backward Propagation

dZ2 = A2 - Y_onehot       # Output error (softmax + cross-entropy derivative)
dW2 = (1/m) * dZ2 * A1.T  # Weight gradient for layer 2
db2 = (1/m) * sum(dZ2)    # Bias gradient for layer 2
dZ1 = W2.T * dZ2 * ReLU'(Z1)  # Backpropagated error through ReLU
dW1 = (1/m) * dZ1 * X.T   # Weight gradient for layer 1
db1 = (1/m) * sum(dZ1)    # Bias gradient for layer 1

Parameter Update

W = W - alpha * dW         # Gradient descent (alpha = learning rate)
b = b - alpha * db

Data Preprocessing

Three critical details that affect whether training succeeds at all:

  1. Transposition: Data is transposed so each column is one example (x_train.shape = (784, 41000)). This makes the matrix math cleaner: W * X naturally produces activations in column-per-example format.

  2. Normalization: Pixel values are divided by 255 to map from [0, 255] to [0, 1]. Without this, weight updates explode and produce NaN values within a few iterations.

  3. One-hot encoding: Labels are converted from scalars (e.g., 7) to one-hot vectors (e.g., [0,0,0,0,0,0,0,1,0,0]) for computing the output error.

Usage

Setup

conda create --name NN-FS python=3.8
conda activate NN-FS
pip install numpy pandas matplotlib

Data

Download train.csv from Kaggle MNIST and place it in a mnist_data/ folder:

├── Neural_Network_from_Scratch.py
├── Neural_Network_from_Scratch.ipynb
└── mnist_data/
    └── train.csv

Run

python Neural_Network_from_Scratch.py

The script will:

  1. Load and preprocess MNIST data (41,000 train / 1,000 test split)
  2. Train for 750 iterations with learning rate 0.1
  3. Plot loss and accuracy curves
  4. Display predictions on individual test images
  5. Print final test set accuracy

Key Implementation Details

  • sum() vs np.sum() in Softmax: The implementation uses Python's built-in sum() rather than np.sum() so that the sum is taken over individual columns (examples) rather than the entire matrix. This is a subtle but critical distinction.

  • Weight initialization: Weights are initialized from np.random.rand() - 0.5, giving values in [-0.5, 0.5]. This keeps initial activations small enough to avoid saturation.

  • No batch processing: The full training set is used in every gradient computation (batch gradient descent, not mini-batch or stochastic).

Known Limitations

  1. 10-neuron bottleneck: The hidden layer with only 10 neurons limits representational capacity. A 128 or 256 neuron hidden layer would improve accuracy significantly.
  2. No mini-batching: Uses the full dataset for every gradient update, which is slower to converge than mini-batch SGD.
  3. No regularization: No dropout, weight decay, or other regularization techniques.
  4. Fixed learning rate: No learning rate scheduling or adaptive optimizers (Adam, etc.).
  5. Single hidden layer: A deeper network would capture more complex features.

Companion Paper

The repository includes a companion paper, "Neural Networks From Scratch: The Core Mathematics," which provides formal derivations of every gradient used in the backward propagation step. The paper covers:

  • The chain rule applied to matrix operations
  • Derivation of the softmax + cross-entropy gradient
  • Why ReLU's derivative is simply a binary mask
  • How the (1/m) normalization factor ensures stable gradient magnitudes

Project Context

Built as an educational project to understand the core mathematics of neural networks without relying on autograd or framework abstractions. Pairs with sys_micrograd, which takes the complementary approach of building an autograd engine that handles the gradient computation automatically.

License

MIT

About

Neural network implementations from scratch — forward/backward pass, training loops

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors