phi-auto Research Log

On-device language model training from scratch. Snapdragon ARM, Python+numpy, no PyTorch.

Setup


Hardware	Snapdragon AArch64, 8 cores, 10GB RAM (3GB usable)
Software	Python 3.13, numpy 2.4.3 (OpenBLAS), Termux on Android
Dataset	TinyStories — 20K stories, 4.3M train tokens, 224K val tokens
Tokenizer	Byte-pair encoding, vocab=1024
Architecture	GPT (RMSNorm, RoPE, SwiGLU, causal attention), manual backward
Throughput	2,196 tok/s at 1.3M params (post-optimization)

Research Question

At what model size and data volume does a transformer transition from memorizing token frequencies to learning actual language patterns?

Small model scaling laws — empirically measuring the critical threshold in the 1K–10M parameter range where a model first breaks past unigram-level performance (token frequency) and begins learning contextual patterns (bigram+).

Reference Baselines (TinyStories val set)

Random (uniform)     6.93 nats
Unigram              6.25 nats
Bigram               3.36 nats     ← below this = contextual learning
Trigram              1.88 nats

Core Hypothesis

The 1.3M model's plateau at unigram level is not a training bug — it's a capacity-data scaling boundary. Increasing parameters, reducing data diversity, or adjusting their ratio should break through the wall.

Experiment Table

All experiment results in one place. New experiments appended below.

ID	Date	Params	Config	Steps	val_loss	vs Unigram	Status	Notes
B1	03-24	1.3M	128d/4L Lion lr=3e-5	2000	6.17	-1.3%	Baseline	Plateau at unigram level
μ1	03-24	1.3M	128d/4L AdamW lr=3e-4	300	6.37	+1.9%	Micro	AdamW converges 3x faster
μ2	03-24	1.3M	128d/4L Lion lr=3e-5	300	6.56	+5.0%	Micro	Lion converges slower
E2	03-24	3.7M	192d/6L AdamW lr=6e-4	-	div	-	Failed	lr too high, diverged
E3	03-24	1.3M	128d/4L AdamW lr=3e-4 clip=5	400	~6.5	+4%	Killed	Still above unigram
μ3	03-24	1.3M	128d/4L 5-story overfit	~500	5.38	-13.9%	Micro	Below unigram. Architecture works.

vs Unigram: position relative to unigram CE (6.25). Negative = better than unigram (learning language patterns).

Log

Reverse chronological (newest first). Each entry should be self-contained.

2026-03-24: The Unigram Wall

Observation: Baseline (1.3M, 2K steps) val_loss=6.17 ≈ unigram CE=6.25. Model learned token frequencies only.

Controlled experiments:

5-story overfit → loss 5.38 (below unigram). Architecture confirmed correct.
100-story overfit → loss 6.98 (random level). Data diversity exceeds capacity.
Lion vs AdamW both plateau at bpb ~7.5. Not an optimizer issue → capacity limit.

Diagnosis: vocab=1024 → 1M possible bigrams. Model has 1.3M total params. After embedding, attention, and MLP structure, insufficient free capacity to encode bigram statistics over a large dataset. With 5 stories (~200 unique bigrams) it works. With 20K stories, it doesn't.

Conclusion: Not a software bug. A capacity-data scaling boundary. Need to increase parameters or reduce data diversity.

2026-03-24: EXP-002, EXP-003 Failed

E2 (192d/6L, 3.7M, lr=6e-4): Immediate divergence. Loss 6.35 → 7.16. lr=6e-4 too high for 3.7M model.

E3 (128d/4L, 1.3M, AdamW lr=3e-4, clip=5, wd=0.01): Loss ~6.5 at step 400. Noisy, never dropped below unigram. Same capacity = same result.

Takeaway:

3.5M+ models need lr ≤ 3e-4
1.3M model cannot break unigram regardless of hyperparameter tuning (capacity ceiling)

2026-03-24: Optimizer Comparison (Lion vs AdamW)

Setup: Same model (128d/4L), same data, 300-step micro-experiments.

	val_loss (300 steps)	Convergence
AdamW lr=3e-4	6.37	Fast
Lion lr=3e-5	6.56	3x slower

Analysis: Lion's sign-based updates discard gradient magnitude. At large scale this doesn't matter. At 1.3M params, every bit of gradient information counts. AdamW is better at this scale.

2026-03-24: Baseline Training Complete

Config: 128d/4L (1.3M params), Lion, 2000 steps, batch=4, seq=128 Result: val_loss=6.17, val_bpb=7.41, 1,350 tok/s Generated text: Complete gibberish Time: ~26 minutes

Optimization Log

Performance improvements. Each entry includes measurements.

ID	What	Before	After	Gain	Status
OPT-001	Fused MLP gate+up	326ms/step	233ms/step	+40% tok/s	Kept
OPT-002	Heap-based BPE encode	13+ min / 19K texts	37 sec	21x	Kept
OPT-003	Incremental BPE training	24+ min	290 sec	~50x	Kept
OPT-004	Mmap data loading	~5ms/batch	~0ms/batch	eliminated	Kept
OPT-005	RoPE backward (exact)	biased gradients	exact gradients	correctness	Kept
OPT-006	Grad accum buffer fix	crash	works	correctness	Kept
OPT-007	NEON C matmul	OpenBLAS baseline	2.5-4.5x slower	negative	Abandoned
OPT-008	RWKV time-mixing	attn 6.9ms/block	RWKV 9.8ms/block	0.7x (slower)	Kept for inference

Details (click to expand)

OPT-001: Fused MLP. Merged gate and up projections into single Linear. Two BLAS dispatches → one. At small matrix sizes, dispatch overhead dominates actual compute.

OPT-002: Heap BPE. O(M×N) scan → O(N log N) heap. Doubly-linked list + min-heap for merge candidates. Each merge application is O(log N) amortized.

OPT-003: Incremental pair count. Full recount per merge → update only affected neighbors. Python loop iterations: 768M → ~tens of thousands.

OPT-004: Mmap. Pre-tokenize entire dataset to uint16 .npy. Load via np.load(mmap_mode='r'), let OS handle paging.

OPT-005: RoPE backward. Inverse of rotation matrix = negate sin component. Orthogonal matrices satisfy R⁻¹ = Rᵀ.

OPT-006: Buffer fix. Grad accumulation buffers allocated at step 0 before any backward → gradients are None → crash. Fixed with lazy allocation after first backward.

OPT-007: NEON C. Wrote 4×4 micro-kernel sgemm. OpenBLAS's assembly-level ARM kernels with cache blocking are 2.5-4.5x faster. Matmul is not the bottleneck — Python overhead is.

OPT-008: RWKV. Theoretical O(T) < attention O(T²). In practice, Python for-loop is slower than numpy's single BLAS call for attention. Wins at inference (O(1) per token) but loses at training.

Profiling (1.3M params, batch=4, seq=128)

Forward     45%     Backward    55%
────────────────────────────────────
Embedding    0.1%   lm_head      22% of bwd
Attention   16%     Attention    ~16% of bwd
MLP         20%     MLP          ~20% of bwd
Loss         2%
────────────────────────────────────
Total: 233ms/step, 2,196 tok/s
OpenBLAS: 20-22 GFLOPS on this Snapdragon

Confirmed Findings

Only things verified across multiple experiments.

The unigram wall is a capacity limit. The 1.3M model plateaus near unigram CE (6.25) regardless of optimizer or hyperparameters. Only the 5-story overfit broke through. (B1, μ1, μ2, E3, μ3)
AdamW > Lion at small scale. Sign-based updates discard gradient magnitude, which hurts when model capacity is tight. (μ1 vs μ2)
BLAS dispatch overhead > FLOPs at small sizes. Fusing operations beats algorithmic improvements. (OPT-001)
Python overhead is the actual bottleneck. Custom C SIMD is slower than OpenBLAS. Theoretical O(T) RWKV is slower than O(T²) attention. Same root cause: interpreter overhead between compute kernels. (OPT-007, OPT-008)

Architecture Decisions

Updated with rationale when changed.

Decision	Choice	Rationale	Alternatives Considered
Framework	numpy only	No PyTorch ARM overhead; educational purpose	PyTorch Mobile, tinygrad
Normalization	RMSNorm	30% cheaper than LayerNorm	LayerNorm
Positions	RoPE	Zero learnable params, extrapolation	Learned, ALiBi
Activation	SwiGLU (fused)	+1-3% over GELU at same param count	GELU, ReLU
Optimizer	AdamW (default)	Better than Lion at this scale	Lion, SF-AdamW
Data loading	mmap	Eliminates runtime tokenization	Streaming JSON
Attention	Standard causal	Faster than RWKV in numpy training	RWKV (kept for inference)

Status

Current phase: Preparing scaling experiments

Working:

GPT forward + backward (manual gradients, verified)
Three optimizers (AdamW, Lion, Schedule-Free AdamW)
BPE tokenizer (heap encode, incremental training)
Mmap data pipeline
Autonomous experiment runner
Self-improvement modules (STaR, SPIN — waiting on model quality)

Blocked:

Breaking the unigram wall (need 3.5M+ model)
Model that generates recognizable English
Self-improvement loop (needs generation quality)

Next:

Model size × data size grid experiment (scaling law measurement)
3.5M model with conservative training (lr=1e-4, 2+ epochs)
C port evaluation (Python overhead is the bottleneck)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phi-auto Research Log

Setup

Research Question

Reference Baselines (TinyStories val set)

Core Hypothesis

Experiment Table

Log

2026-03-24: The Unigram Wall

2026-03-24: EXP-002, EXP-003 Failed

2026-03-24: Optimizer Comparison (Lion vs AdamW)

2026-03-24: Baseline Training Complete

Optimization Log

Profiling (1.3M params, batch=4, seq=128)

Confirmed Findings

Architecture Decisions

Status

FilesExpand file tree

DEVLOG.md

Latest commit

History

DEVLOG.md

File metadata and controls

phi-auto Research Log

Setup

Research Question

Reference Baselines (TinyStories val set)

Core Hypothesis

Experiment Table

Log

2026-03-24: The Unigram Wall

2026-03-24: EXP-002, EXP-003 Failed

2026-03-24: Optimizer Comparison (Lion vs AdamW)

2026-03-24: Baseline Training Complete

Optimization Log

Profiling (1.3M params, batch=4, seq=128)

Confirmed Findings

Architecture Decisions

Status