A from-scratch C++ deep learning library implementing reverse-mode automatic differentiation with a dynamic computation graph. Inspired by micrograd and PyTorch.
Built to understand the internals of deep learning frameworks by implementing everything from first principles:
- No dependencies — Pure C++17, header-only library
- Educational — Clear, readable implementations over heavy optimization
- Complete — From scalar autograd to CNNs, RNNs, and optimizers
- Automatic Differentiation — Reverse-mode autodiff with dynamic tape
- Tensor Operations — N-dimensional arrays with broadcasting
- Computation Graph — Visualization in DOT, ASCII, and Mermaid formats
| Layer | Description |
|---|---|
Linear |
Fully connected layer |
Conv2D |
2D convolution with padding/stride |
MaxPool2D / AvgPool2D |
Pooling layers |
Dropout |
Regularization |
BatchNorm1d |
Batch normalization |
RNNCell / LSTMCell / GRUCell |
Recurrent layers |
ReLU · LeakyReLU · GELU · Swish · Tanh · Sigmoid · Softplus
SGD (with momentum) · Adam · RMSprop · AdaGrad
MSE · MAE · Huber · CrossEntropy · BCE · Hinge · KLDiv
- LR Schedulers — Step, Cosine, Warmup, Exponential, ReduceOnPlateau
- Gradient Clipping — By value or norm
- Weight Init — Xavier, He/Kaiming (uniform & normal)
- Serialization — Save/load models to binary files
- Padding and packing for variable-length sequences
- Attention masks
- One-hot encoding and embedding lookup
- Sliding windows
- SIMD — SSE/AVX vectorization for tensor math
- Thread Pool — Async operations and parallel for loops
- OpenMP — Parallel matrix multiply, convolutions, reductions
- Memory Pool — Cache-aligned allocations, arena allocator
#include "tensor.hpp"
#include "nn.hpp"
#include "optimizer.hpp"
// Create a simple network
nn::Sequential model;
model.add(std::make_shared<nn::Linear>(784, 128));
model.add(std::make_shared<nn::ActivationLayer>(nn::Activation::ReLU));
model.add(std::make_shared<nn::Linear>(128, 10));
// Forward pass
auto x = randn({784, 32}); // [features, batch]
auto y = model.forward(x);
// Compute loss and backprop
auto loss = mse(y, target);
loss->backward();
// Update weights
SGD optimizer(model.parameters(), 0.01);
optimizer.step();
optimizer.zeroGrad();# Header-only, just include
clang++ -std=c++17 -O2 -I include your_code.cpp -o your_program
# With OpenMP (optional)
clang++ -std=c++17 -O2 -Xpreprocessor -fopenmp -I include your_code.cpp -lomp
# With threading
clang++ -std=c++17 -O2 -pthread -I include your_code.cppinclude/
├── value.hpp # Scalar autograd engine
├── tensor.hpp # N-dimensional tensor with autograd
├── nn.hpp # Neural network modules
├── conv.hpp # CNN layers (Conv2D, pooling)
├── rnn.hpp # RNN cells (LSTM, GRU)
├── optimizer.hpp # SGD, Adam, schedulers
├── loss.hpp # Loss functions
├── serialize.hpp # Model save/load
├── visualize.hpp # Graph visualization
├── sequence.hpp # Sequence utilities
├── simd.hpp # SIMD operations
├── threadpool.hpp # Thread pool
├── parallel.hpp # OpenMP parallelization
├── mempool.hpp # Memory management
└── benchmark.hpp # Profiling tools
src/
├── tensor_demo.cpp
├── nn_demo.cpp
├── conv_demo.cpp
├── rnn_demo.cpp
├── optimizer_demo.cpp
└── ...
On Apple M2 (single-threaded):
| Operation | Performance |
|---|---|
| SIMD matmul 128×128 | 14 GFLOPS, 489× vs naive |
| SIMD elementwise (100K) | 14× vs naive |
| Parallel sum (1M, 12 threads) | 4× vs sequential |
| Arena allocator | 70× vs malloc |
MIT