Releases: oisee/z80-optimizer
v1.3.0 — Paper v2.2, IX/IY halves, Multi-Platform DSL, Image Search
v1.3.0 Release
Paper: "Register Allocation as a Solved Game" v2.2
- regalloc_paper.pdf (A4) — 1600+ lines, 15 chapters, 12 Mermaid diagrams
- regalloc_paper_a5.pdf (A5) — for tablets/phones
- regalloc_paper.epub — with rendered diagrams (700KB)
- NEW: Clear data breakdown table (83.6M → 37.6M → 78MB)
- NEW: Multi-platform GPU section (ISA DSL → CUDA/Vulkan/Metal/OpenCL)
- NEW: Corpus validation (820 functions, 246 signatures)
IX/IY Half Registers Declared Production-Safe
IXH, IXL, IYH, IYL work on all known Z80 silicon. 11 registers for allocation.
H↔IXH transfer via EX DE,HL trick (16T). IX/IY = bridges between main/shadow banks.
Corpus Bias Discovery
- VIR corpus (820 funcs): 79% ≤6v — biased toward small functions
- FatFS analysis: 16 functions, ALL >10v (est. 10-35 vregs)
- Greedy partition for v10-v32 in <30ms each
pRNG Image Search Gallery
- CNN-guided search: 7 targets × top-10 seeds (160 images)
- Dithered target with VGG perceptual loss (Introspec's BB approach)
- 6 synthetic targets: skull, heart, star, smiley, tree, fish
- ISA DSL: single source → 4 GPU backends, 3 vendors
Also included
- chronicle.epub (862 lines, 11 chapters, days 1-3)
- z80_compiler_never_guesses.epub/pdf (original article)
Full changelog: v1.2.0...v1.3.0
v1.2.0 — Day 3: Enriched Tables, O(1) Regalloc, Branchless Library
Day 3 Birthday Marathon Release
New: Enriched Register Allocation Tables (37.6M shapes)
- data/enriched_{4v,5v,6v_dense}.enr.zst — 78MB compressed
- 15 operation-aware cost metrics per shape
- 43% of "feasible" shapes lack accumulator (hidden ALU infeasibility!)
- Smart CALL save: 17T avg (was 34T) = 50% reduction
- Width-aware: u8/u16/u32 scoring
New: O(1) Regalloc Architecture
- Signature: (interference_shape, operation_bag) → hash lookup
- 91% of functions resolve in O(1), 8% GPU partition, 1% Z3
- Validated on 820-function production compiler corpus
- 246 unique signatures (prediction <500 confirmed)
New: GPU Partition Optimizer
- Brute-force optimal graph splits for 7-20v functions
- All 172 corpus functions (7v+) partitioned optimally
- 14v in 0.7s, 19v in 7min, 20v in 30min
New: Paper v2 — "Register Allocation as a Solved Game"
- regalloc_paper.pdf — 1574 lines, 15 chapters, 12 Mermaid diagrams
- Updated with corpus validation and partition results
- regalloc_paper.epub — mobile-friendly version
New: Branchless Library (exhaustive verified)
- ABS (6i,24T), MIN/MAX (8i,32T), CMOV (6i,24T)
- div3 EXACT: A×171>>9 (zero error, no lookup table!)
- Z flag WRITE-ONLY proof (CY = only viable branchless bool)
- SBC A,A mask trick = foundation of all branchless Z80
New: Focused Search + Vulkan
- gray_decode EXACT (13 ops, <1s on Vulkan RX 580)
- sqr_hi ±29, cbrt ±16, sin_q1 ±68
- Per-target minimal op pools: fewer ops → deeper search
Updated: Chronicle
- chronicle.epub — 862 lines, 11 chapters, days 1-3
Also included (from v1.0.0/v1.1.0)
- z80_compiler_never_guesses.epub/pdf/html — original article
Full changelog: v1.1.0...v1.2.0
v1.1.0 — Day 2: FP16, BCD, Multi-Target GPU Search & Chronicle
Day 2 Birthday Marathon Release
New: Chronicle (EPUB)
- chronicle.epub — "Книга о том, как мы за два дня доказали всю арифметику Z80" (ru)
- 770 lines, 10 chapters — narrative of the entire birthday marathon
New: FP16/BCD Library
pkg/fp16/— FP16/FP24 floating point + BCD arithmetic for Z80- BCD H-flag fix: unlocked BCD×2 =
ADD A,A; DAA(2 ops, 8T)
New: Multi-Target GPU Search
- One GPU pass tests 15 non-linear functions simultaneously
- 13-op trimmed pool (12× faster than 16-op)
- 3-way GPU split across cluster
- Targets: sin, cos, sqrt, log2, gamma, smoothstep, triangle...
New: 21-Op Universal Kernel
- Minimum instruction set for ALL optimal Z80 arithmetic
- Only 2.7% of ISA generates every known optimal sequence
New: Analytics
- Cross-table analysis of 755 optimal sequences
- "What 755 Optimal Sequences Teach Us About Computation"
Data Enrichment
- mul16 table: clobber_mask added (all 254 preserve A, all DE-safe)
- 16-bit arithmetic idioms + BCD idioms (GPU-proven)
Also included
- z80_compiler_never_guesses.epub/pdf/html — original article from v1.0.0
Full changelog: v1.0.0...v1.1.0
v1.0.0 — Birthday Release 🎂
z80-optimizer v1.0.0 — Birthday Release 🎂
GPU Brute-Force Superoptimizer for Z80
The first Z80 compiler toolchain that never guesses — every optimization is provably optimal, verified by exhaustive GPU search.
Highlights
372 Provably Optimal Arithmetic Sequences
- 254/254 constant multiplies (u8, avg 8× faster than shift-and-add)
- 254/254 16-bit multiplies (u8×K→u16, complete in 30 seconds)
- 118/120 constant divisions (u8/K, avg 2.5× faster than general loop)
- 15 branchless idioms (abs, bool, sign, not, lsb, ...)
83.6 Million Register Allocations
- Complete exhaustive table for ≤6 virtual registers (32MB compressed)
- Feasibility phase transition: 96% feasible at 2v → 0.9% at 6v
- Five-level pipeline: O(1) table → composition → GPU → backtracking → Z3
739K Peephole Rules
- Every length-2 instruction pair tested against all possible inputs
SLA A : RR A → OR A(save 12T, 3 bytes) and 738,999 more
4-GPU, 3-Vendor Cluster
- NVIDIA RTX 4060 Ti × 2 (CUDA)
- NVIDIA RTX 2070 (CUDA)
- AMD Radeon RX 580 (OpenCL + Vulkan via Mesa)
- Apple M2 (Metal)
- All results cross-verified across platforms
Multi-Backend GPU DSL
pkg/gpugen/: one ISA definition → CUDA / Metal / OpenCL / Vulkan- Z80 and 6502 ISA definitions included
--mask/--disablefor op pool pruning
Key Discoveries
| Discovery | Impact |
|---|---|
| Pool reduction 21→3 ops | 13,600× speedup for u16 multiply |
| Guided brute-force | Abstract chains → 6-op GPU = div3 in 11 sec |
| Prefix sharing | 51-86% code compression for multiply library |
| Branchless ABS(A) | 6 insts, 24T — no branch penalty |
| Sign-extend in 3 insts | ADC overflow → SBC mask (12T) |
| NEG HL via EX DE,HL | 3 insts, 23T (improved from 5/28T) |
| Feasibility cliff at 6v | 99.1% of constraint shapes are impossible |
| Treewidth analysis | Random≠real: compiler graphs are denser |
Data Files (data/)
| File | Entries | Size |
|---|---|---|
| exhaustive_4v.bin.zst | 156K regalloc shapes | 64KB |
| exhaustive_5v.bin.zst | 17.4M regalloc shapes | 8.5MB |
| exhaustive_6v_dense.bin.zst | 66.1M regalloc shapes | 22.5MB |
| mulopt8_clobber.json | 164 multiply sequences | with B-clobber info |
| mulopt16_complete.json | 254 u16 multiply sequences | 3-op basis |
| div8_optimal.json | 118 division sequences | guided brute-force |
| peephole_top500.json | top 500 peephole rules | by cycle savings |
| chains_mul8_mod256.json | 254 abstract multiply chains | ISA-independent |
| chains_div8.json | 86 abstract division chains | compose from mul |
| mul8_library.asm | Z80 assembly library | prefix-shared |
Go Packages
import "github.com/oisee/z80-optimizer/pkg/mulopt" // Emit8, Emit16, Lookup
import "github.com/oisee/z80-optimizer/pkg/regalloc" // LoadBinary, IndexOf
import "github.com/oisee/z80-optimizer/pkg/peephole" // Lookup, LoadRules
import "github.com/oisee/z80-optimizer/pkg/gpugen" // ISA DSL → 4 backendsDocumentation
docs/glossary.md— complete term referencedocs/insights.md— experiment log with discoveriesdocs/paper_seed_superopt.md— paper/book seed (8 sections)docs/sprint_129.md— sprint plandocs/book_outline.md— book outline (19 chapters)docs/gpugen.md— multi-backend DSL guidedocs/m2_onboarding.md— new machine setupdata/README.md— binary format spec + Python/Go readers
The Birthday Team 🎂
Built in one marathon session across 5 machines, 4 APIs, 3 GPU vendors, and multiple Claude sessions collaborating via ddll.
Packed Multiply/Division Library
The 254 multiply and 118 division sequences share instruction prefixes extensively. The packed library uses multiple entry points via labels — each constant starts at its label and falls through shared code to a common RET:
mul104: ; entry for ×104
mul52: ADD A,A ; entry for ×52 (×104 = ×52 × 2)
mul26: ADD A,A ; entry for ×26
mul24: ADD A,B ; entry for ×24
mul12: ADD A,A ; entry for ×12
mul6: LD B,A ; entry for ×6
ADD A,B
ADD A,B
mul2: RLA ; entry for ×2
RET ; shared return — 7 constants, 9 instructions!mul8 library: 594 bytes for 164 constants (51% compression)
mul16 library: ~500 bytes for 254 constants (86% compression)
Combined: ~1.1KB for ALL optimal multiplies = 2.2% of ZX Spectrum 48KB
For division: same principle — reciprocal multiply phases overlap between divisors sharing the same magic constant M.
Runtime dispatch: page-aligned jump table (256 bytes) + packed code.
LD H, page / LD L, K / JP (HL) = single-instruction dispatch to optimal sequence.
v0.1.0 — 602,008 Z80 optimizations found
What is this?
A brute-force superoptimizer for the Zilog Z80 processor.
Given any sequence of Z80 instructions, it exhaustively searches for a shorter equivalent — one that produces identical register and flag state for every possible input. No heuristics, no pattern databases, just brute force with smart pruning.
First run results
602,008 provably correct optimizations found from 8.4M length-2 target sequences.
- 34.7 billion comparisons checked
- 3 hours 16 minutes on Apple M2
- 83 unique transformation patterns discovered
- Results:
rules.json.gz(3.4MB compressed, 102MB uncompressed)
Highlights
| Original | Replacement | Savings | Insight |
|---|---|---|---|
SLA A : RR A |
OR A |
-3B, -12T | Shift left then rotate right = identity + flag set |
SRL A : RL A |
OR A |
-3B, -12T | Shift right then rotate left = identity + flag set |
AND 00h : NEG |
SUB A |
-3B, -11T | Zero then negate = zero with subtract flags |
LD A, 00h : SLA A |
XOR A |
-3B, -11T | Load zero then shift = zero with logic flags |
ADD A, A : RR A |
OR A |
-2B, -8T | Double then halve = identity + flag set |
SRL A : SLL A |
OR 01h |
-2B, -9T | Shift right then undoc-shift-left = set bit 0 |
XOR 0FFh : NEG |
SUB 0FFh |
-2B, -8T | Complement then negate |
CPL : NEG |
SUB 0FFh |
-1B, -5T | Complement then negate = increment with sub flags |
XOR 0FFh : SBC A, 0FFh |
NEG |
-2B, -6T | Complement + subtract-with-borrow = negate |
SCF : RL A |
SLL A |
-1B, -4T | Set carry then rotate-through-carry = undoc shift |
SCF : ADC A, 00h |
ADD A, 01h |
-1B, -4T | Set carry then add-with-carry zero = add 1 |
SET 0, L : DEC HL |
RES 0, L |
-1B, -6T | Set bit then decrement = just clear the bit |
RES 0, L : INC HL |
SET 0, L |
-1B, -6T | Clear bit then increment = just set the bit |
ADD A, 80h : OR A |
XOR 80h |
-1B, -4T | Flip sign bit with correct flag behavior |
ADD A, 00h : RL A |
SLA A |
-2B, -7T | Clear carry then rotate = shift left |
AND 0FFh : RR A |
SRL A |
-2B, -7T | Clear carry via AND then rotate = shift right |
Breakdown by savings
| Bytes saved | Count |
|---|---|
| 3 bytes | 1,212 |
| 2 bytes | 580,937 |
| 1 byte | 19,859 |
Note: it correctly rejects false optimizations like LD A, 0 → XOR A because XOR A modifies flags while LD A, 0 does not. Every rule is verified against all possible inputs.
What's included
406 Z80 opcodes covering all register-only operations:
- 8-bit loads, arithmetic, logic (ADD, ADC, SUB, SBC, AND, OR, XOR, CP)
- All CB-prefix shifts and rotates (RLC, RRC, RL, RR, SLA, SRA, SRL, SLL)
- BIT/RES/SET n, r — all 168 register variants
- 16-bit pair ops (INC/DEC rr, ADD HL, EX DE,HL, LD SP,HL)
- 16-bit immediate loads (LD BC/DE/HL/SP, nn)
- ED-prefix arithmetic (ADC HL, SBC HL)
- Specials: DAA, CPL, NEG, SCF, CCF, NOP
Usage
go build -o z80opt ./cmd/z80opt
# Find optimal replacement for a specific sequence
./z80opt target "AND 0xFF"
# → Replacement: AND A (-1 bytes, -3 cycles)
./z80opt target "SLA A : RR A"
# → Replacement: OR A (-3 bytes, -12 cycles)
# Run full enumeration
./z80opt enumerate --max-target 2 --output rules.json -vRoadmap
See docs/GenPlan.md for the full implementation plan.
Next phases:
- STOKE stochastic search — MCMC random mutations for length 5-10+ sequences
- CUDA brute force — GPU-accelerated length-3 complete search (~20 min on 2× RTX 4060 Ti)
- Reordering optimizer — dependency DAG analysis to apply rules to real code with interleaved instructions
- More opcodes — (HL) memory ops, PUSH/POP, IX/IY indexed addressing