Language model training from scratch on an Android phone.
GPT architecture, BPE tokenizer, AdamW/Lion optimizers, manual backward pass — all in ~2,500 lines of Python+numpy. Runs on Termux, no PyTorch, no GPU.
Based on the autoresearch approach: build → measure → improve → repeat.
# Termux on Android (or any ARM Linux)
pkg install python numpy
git clone https://github.com/Hybirdss/phi-auto.git && cd phi-auto
python src/engine/train.py # Train (~30 min)
python src/agent/run_experiments.py # Automated experiment sweepRequires: Python 3.10+, numpy, ~3GB RAM, ~5GB storage
Standard GPT with RMSNorm, RoPE, SwiGLU, causal attention. Forward and backward pass both implemented manually — no autograd. Weight tying optional.
Byte-pair encoding. Two optimizations made during development:
- Heap-based encoding: O(N log N) instead of O(M×N). 13 min → 37 sec for 19K texts.
- Incremental pair counting during training: O(affected) instead of O(N) per merge. 24 min → 5 min.
AdamW, Lion, Schedule-Free AdamW. All from scratch.
TinyStories dataset. Pre-tokenized to .npy, loaded via memory-mapped I/O (zero-copy).
Autoresearch-style loop: mutate hyperparameters → train → evaluate → keep or discard. 6 mutation strategies. Logs everything.
Self-improvement modules (STaR, SPIN) are implemented but waiting for a model that can generate coherent text first.
Step time 233 ms
Throughput 2,196 tokens/sec
After optimization. Started at 326ms / 1,572 tok/s. Main gains from fusing MLP projections (two matmuls → one).
Forward 45% (OpenBLAS matmuls)
Backward 55% (weight + input gradients per layer)
| Config | val_loss | Notes |
|---|---|---|
| 128d/4L Lion 2K steps | 6.17 | Plateau at unigram level |
| 128d/4L AdamW 300 steps | 6.37 | AdamW converges ~3x faster than Lion here |
| 128d/4L 5-story overfit | 5.38 | Below unigram — context learning works |
| 192d/6L AdamW lr=6e-4 | diverged | lr too high |
The most useful thing that came out of this so far.
N-gram cross-entropies on TinyStories validation set:
Random (uniform) 6.93 nats
Unigram 6.25 nats
Our model 6.17 nats ← basically unigram
Bigram 3.36 nats
Trigram 1.88 nats
The 1.3M model after 2,000 steps learns token frequencies and nothing else. val_loss ≈ unigram CE. No bigram patterns, no word order, no grammar.
But: the 5-story overfit test gets loss to 5.38 (below unigram). So the architecture and gradients are correct. The problem is capacity — 1.3M params can't compress 20K diverse stories beyond marginal statistics.
Next step is scaling to 3.5M+ params to see if the wall breaks.
Full analysis in DEVLOG.md.
Documenting these because they cost time and the reasons aren't obvious:
- Custom NEON SIMD matmul in C — wrote a 4×4 micro-kernel. OpenBLAS was 2.5-4.5× faster. Their ARM kernels are assembly-optimized with cache blocking. Not worth competing.
- RWKV time-mixing for training — O(T) complexity but needs a Python for-loop. Attention's O(T²) runs as a single BLAS call, which is faster in practice. (RWKV does win at inference.)
- Lion optimizer at small scale — sign-based updates throw away gradient magnitude. At 1.3M params that information matters. AdamW works better here.
- lr=6e-4 for 3.5M model — diverges immediately. 3e-4 is the safe zone.
src/
├── engine/ # Model, training, tokenizer, optimizers
│ ├── model.py # GPT forward + backward
│ ├── train.py # Training loop
│ ├── tokenizer.py # BPE (heap-optimized)
│ ├── optim.py # AdamW, Lion, Schedule-Free AdamW
│ └── rwkv_tmix.py # RWKV (experimental)
├── data/ # Dataset preparation, mmap loader
├── agent/ # Experiment runner, self-improvement
└── tools/ # Monitoring, checkpoints, config
- Scale model to 3.5M+ params, train 2+ epochs
- Break below bigram-level loss (prove context learning at scale)
- Run self-improvement loop (STaR/SPIN) once generation works
- Possibly port to C for 5-10× throughput