A GPU-accelerated superoptimizer for the Zilog Z80 processor. The compiler that never guesses — every optimization is provably optimal.
- div8 254/254 COMPLETE — 6 methods, avg 79T (beats SDCC's 80–200T). GPU-discovered
carry_comparetrick: 5 ops, 26T for K≥128 - 1500+ optimal arithmetic sequences: 254 mul8 + 254 mul16 + 254 div8 + 254 mod8 + 13 u32 ops + 25 branchless idioms
- sat_add8 = 16T —
ADD A,B; LD C,A; SBC A,A; OR C— 4-instruction branchless saturating add - SHA-256 on Z80: 58ms/block @3.5MHz from proven u32 primitives (13 operations, SHL32/SHR32 optimal)
- 3 CUDA image generators for ZX Spectrum demoscene: cat 4.9%, Che Guevara 15%, Einstein 15.1% — gallery
- O(1) regalloc integrated into MinZ-VIR compiler — 156K enriched table entries, zero Z3
- 4× cross-verified: z80-optimizer + MinZ + MinZ-VIR + MinZ-ABAP
| Day | Date | Theme | Key Achievement |
|---|---|---|---|
| 1 | Mar 26 | Release | v1.0.0, article, 500+ sequences |
| 2 | Mar 27 | Expansion | FP16/BCD, 21-op universal pool, multi-target search — log |
| 3 | Mar 28 | Proofs | 37.6M enriched shapes, Z flag proof, SBC A,A library — log |
| 4 | Mar 28 | Regalloc | Five-level pipeline, backtracking solver, phase cliff — log |
| 5 | Mar 29 | Images+u32 | 3 CUDA generators, u32 library, Introspec BB port — log |
| 6 | Mar 29 | Division | div8 v3 (−49%), carry_compare, sign/sat/arith16 — log |
GPU-accelerated search for recognizable images from minimal data generators (128–1194 bytes). Three methods, all CUDA. See the full gallery (29 experiments).
| Cat (4.9%, 128B) | Che Guevara (15%, 1194B) | Einstein (15.1%, 128B) |
|---|---|---|
![]() |
![]() |
![]() |
- Dual-layer evolutionary: 5 layers (3 additive OR + 2 subtractive AND NOT), island model, 557K img/s
- Segmented hierarchical LFSR: brute-force 65536 seeds per segment, guaranteed convergence
- Introspec BB port: 24-bit Galois LFSR, 66 layers, CUDA port runs 4 min vs days on CPU
- 83.6M exhaustive register allocations for ≤6 virtual registers (78MB enriched, 15 metrics per shape)
- 97.7% infeasibility at 7-15 vregs — proven that Z80 MUST decompose most real functions
- Multi-backend DSL: one ISA definition → CUDA / Metal / OpenCL / Vulkan kernels
- Cross-verified on 6 platforms, 4 APIs, 3 GPU vendors (NVIDIA CUDA + AMD Vulkan/OpenCL + Apple Metal 3)
- 739K peephole rules (len-2 complete, len-3 37M partial)
📄 Article: The Z80 Compiler That Never Guesses (PDF / EPUB / HTML)
📖 Book outline: 19 chapters — from peephole rules to universal computation chains
📊 Week 1 report: Birthday marathon results — 1500+ sequences, 12 data files, 4× cross-verified
📋 Roadmap: TODO.md — full task list with effort estimates and priority matrix
📓 Daily logs: day 3 · day 4 · day 5 · day 6 · day 7 seed
Given a sequence of Z80 instructions, it exhaustively searches for a shorter equivalent sequence that produces identical register, flag, and memory state for all possible inputs. No heuristics, no pattern databases — provably correct by construction.
Compilers and hand-written Z80 code contain suboptimal patterns that are hard to spot manually. Some are trivial (LD A, 0 is 2 bytes, XOR A is 1), but many are deeply non-obvious:
| Original | Replacement | Savings | Why it works |
|---|---|---|---|
SLA A : RR A |
OR A |
-3B, -12T | Shift left then rotate right = identity + flag set |
AND 00h : NEG |
SUB A |
-3B, -11T | Zero then negate = zero with subtract flags |
ADD A, A : RR A |
OR A |
-2B, -8T | Double then halve = identity + flag set |
XOR 0FFh : SBC A, 0FFh |
NEG |
-2B, -6T | Complement + subtract-with-borrow = negate |
SCF : RL A |
SLL A |
-1B, -4T | Set carry then rotate-through-carry = undoc shift |
SET 0, L : DEC HL |
RES 0, L |
-1B, -6T | Set bit then decrement pair = just clear the bit |
ADD A, 80h : OR A |
XOR 80h |
-1B, -4T | Flip sign bit with correct flag behavior |
AND 0FFh : RR A |
SRL A |
-2B, -7T | Clear carry via AND then rotate = shift right |
These are not things a human would think to try. The superoptimizer finds them because it tries everything.
The search has three tiers, each filtering more aggressively:
Enumerate all target sequences (e.g., 17.8M for length-2)
|
v
QuickCheck (GPU): 8 test vectors, 80-byte fingerprint
| rejects 99.99% of candidates instantly
v
MidCheck (GPU): 24 additional test vectors (32 total)
| catches ~23% of QuickCheck false positives
v
ExhaustiveCheck (GPU): sweep all input states (up to 2^24)
| 256 threads/block, batched per pipeline flush
| proves exact equivalence for ALL inputs
v
Confirmed optimization --> JSONL output
Run each candidate on 8 carefully chosen test vectors. If the output differs from the target on any vector, the candidate is definitely not equivalent. This eliminates 99.99%+ of the search space in microseconds.
The test vectors are fixed Z80 states covering edge cases: all zeros, all 0xFF, alternating bits, prime-derived values. Eight vectors is enough to reject nearly everything while keeping the fingerprint small (80 bytes).
On the GPU, targets are batched in groups of 512. Each batch tests all ~4,215 candidates simultaneously (512 × 4,215 = 2.1M CUDA threads), comparing against all target fingerprints in a single kernel launch via a bitmap output.
QuickCheck's 8 vectors let through ~27% false positives, especially for BIT/RES/SET instructions whose effects are bit-position-specific. MidCheck runs 24 additional test vectors on QuickCheck survivors on the GPU:
- Single-bit A values (0x01, 0x02, 0x04, ...) to distinguish BIT operations
- Per-register bit patterns covering unique bit positions in B, C, D, E, H, L
- Boundary values (0xBF/0xC0, 0x7F/0x80) to catch edge cases
MidCheck runs as a separate GPU kernel on QuickCheck survivors, eliminating most false positives before the expensive ExhaustiveCheck. The 24 extra vectors are defined alongside the original 8 in z80_common.h and verifier.go.
For the ~0-5 candidates that survive both QuickCheck and MidCheck, we prove equivalence by sweeping all possible input combinations on the GPU. Each (target, candidate) pair gets one thread block of 256 threads — thread i handles A=i and loops over carry and extra registers:
- 0 extra registers: 2 iterations per thread (carry only)
- 1 extra register: 512 iterations per thread (256 values x 2 carry)
- 2 extra registers: 131K iterations per thread (256 x 256 x 2)
- 3+ registers or SP: reduced sweep with 32 representative values per register
Pairs are batched (4096 at a time) and dispatched as a single kernel launch. Early termination uses shared memory atomicOr — once any thread finds a mismatch, all threads in the block stop.
If the target and candidate produce identical output (all 11 bytes: A, F, B, C, D, E, H, L, SP, M) for every input state, the optimization is provably correct.
Before testing, we eliminate obviously redundant candidates:
- NOP elimination: sequences containing NOP (unless targeting NOP)
- Self-load detection:
LD A, Aetc. - Dead write analysis: an instruction's output is overwritten before being read
- Canonical ordering: for independent instructions, enforce a canonical order to avoid duplicate sequences
Pruning uses register dependency bitmasks (opReads/opWrites) for all 394 opcodes.
The Z80 executor is an ideal GPU workload: fixed-size 11-byte state (10 registers + 1 virtual memory byte), pure ALU computation, embarrassingly parallel. Both QuickCheck filtering and ExhaustiveCheck verification run on GPU.
CPU: enumerate targets batch 512 targets
in groups of 512 |
| v
CPU: compute fingerprints ──> GPU Kernel 1: QuickCheck (batched)
(8 QC + 24 Mid vectors) 512 targets × 4215 candidates = 2.1M threads
bitmap output: one bit per candidate per target
atomicOr for lock-free writes
|
v
CPU: collect QC hits, prune, build MidCheck pairs
|
v
GPU Kernel 2: MidCheck
one thread per (target, candidate) pair
24 additional test vectors
rejects ~23% of QC survivors
|
v
GPU Kernel 3: ExhaustiveCheck
256 threads/block, one block per pair
thread i: A=i, loop carry+regs
shared memory early termination
|
v
CPU: output confirmed results as JSONL
| Approach | Length-2 time | Results | Speedup |
|---|---|---|---|
| CPU brute force (Apple M2, 24GB) | 3h 16m | 602,008 | 1x |
| CPU brute force (i7, projected) | ~6h | 602,008 | 0.5x |
| Metal compute (Apple M2, 10-core GPU) | cross-verification | — | — |
| CUDA v2 (RTX 4060 Ti) | ~6.5 min (95.5% of search) | 743,309 | ~30x |
| CUDA v2 (2x RTX 4060 Ti) | ~90 min (100% incl. BIT/SET/RES) | est. ~950K | — |
The CUDA search is a complete standalone binary with GPU QuickCheck, GPU MidCheck, GPU ExhaustiveCheck, pruning, disassembly, and JSONL output. All results verified 100% correct against Go CPU ExhaustiveCheck.
Length-2 search breakdown (RTX 4060 Ti, single GPU):
Opcodes 0-4027 (95.5%): 386s — ALU, LD, rotate/shift, INC/DEC, most BIT/RES/SET
11.1M targets tested, 739,249 optimizations found
QC: 1,009,124 → Mid: 774,347 → Exhaustive: 734,905 → confirmed: 734,905
Opcodes 4028+ (4.5%): heavy BIT/RES/SET area — ~60s per opcode (single GPU)
Each opcode generates ~3K QC hits → ~2.3K MidCheck survivors → ~2.2K confirmed
Bottleneck: nextra=2-4 registers requiring 131K-2M iterations/thread in GPU ExhaustiveCheck
The first 95% completes in ~6.5 minutes. The BIT/SET/RES tail (ops 4028-4215) is the remaining bottleneck — these opcodes test individual bit positions, creating many near-equivalent candidates that require expensive exhaustive verification with 2-4 extra register sweeps. Dual-GPU parallelism halves this tail.
| Stage | Test vectors | Survivors | Rejection rate |
|---|---|---|---|
| QuickCheck (GPU) | 8 | 1,015,036 of 11.1M targets | 99.99% of candidates rejected |
| MidCheck (GPU) | +24 (32 total) | 778,892 | 23% of QC hits rejected |
| ExhaustiveCheck (GPU) | full sweep (up to 2^24) | 739,249 confirmed | 5% of Mid hits rejected |
The 24 MidCheck vectors include single-bit A values, per-register bit patterns, and boundary values, specifically targeting BIT/RES/SET false positives that share identical 8-vector fingerprints. Overall false positive rate from QuickCheck to confirmed: 27%.
# Build CUDA search v2 (3-stage batched pipeline)
nvcc -O2 -o cuda/z80search_v2 cuda/z80_search_v2.cu
# Run length-2 search (output JSONL to stdout, progress to stderr)
cuda/z80search_v2 --max-target 2 > results.jsonl 2>progress.log
# Dual GPU: split the outer loop across GPUs
cuda/z80search_v2 --max-target 2 --gpu-id 0 --first-op-end 2107 > r0.jsonl 2>log0.txt &
cuda/z80search_v2 --max-target 2 --gpu-id 1 --first-op-start 2107 > r1.jsonl 2>log1.txt &
wait
cat r0.jsonl r1.jsonl > results-all.jsonl
# Dead-flags relaxation
cuda/z80search_v2 --max-target 2 --dead-flags 0x28 > results-deadflags.jsonl 2>log.txt
# Verify CUDA results against CPU reference implementation
z80opt verify-jsonl results.jsonlAll CUDA results are verified bit-exact against the Go CPU implementation:
- 743,309 length-2 results produced by v2 GPU pipeline (96% of search complete)
- 5,060 results verified against CPU ExhaustiveCheck — 100% pass rate (1,000 random sample + 4,060 from GPU 1 partition)
- 23,772 early results (v1 run) cross-verified — 100% agreement
The CUDA kernel and Go executor were developed independently and tested against each other across all 394 opcodes. The flag tables, ALU operations, and DAA implementation match exactly.
For sequences beyond length 3, brute force is infeasible (74.8 billion targets at length 3, 315 trillion at length 4). We implement STOKE-style MCMC stochastic search:
- Start from the target sequence (or random)
- Randomly mutate: replace instruction, swap, insert, delete, change immediate
- Evaluate cost: correctness (test vector mismatches) + size (byte count)
- Accept/reject via Metropolis-Hastings (always accept improvements, sometimes accept regressions)
- Repeat until a shorter, fully-correct candidate is found
z80opt stoke --target "LD A, 0" --dead-flags all -v
z80opt stoke --target "AND 0xFF : ADD A, B" --chains 8 --iterations 1000000The default full-equivalence mode correctly rejects LD A, 0 -> XOR A because XOR A modifies flags while LD A, 0 does not. The --dead-flags option adds a second tier of rules tagged with which flag bits must be "dead" (not read before being overwritten) for the rule to be valid.
# Registers-only equivalence (ignore all flags)
z80opt stoke --target "LD A, 0" --dead-flags all -v
# Result: LD A, 0 -> XOR A (-1 byte, dead flags: all)
# Ignore only undocumented flag bits 3 and 5
z80opt enumerate --max-target 2 --dead-flags undoc -vThe consumer (peephole optimizer) uses liveness analysis to determine which flags are dead at each program point before applying these rules.
From 11.1M length-2 target sequences tested (96% of search space). 743,309 provably correct optimizations found.
| Bytes saved | Count |
|---|---|
| 3 bytes | 1,212 |
| 2 bytes | 580,937+ |
| 1 byte | 19,859+ |
More highlights:
| Original | Replacement | Savings | Insight |
|---|---|---|---|
SRL A : RL A |
OR A |
-3B, -12T | Shift right then rotate left = identity + flag set |
LD A, 00h : SLA A |
XOR A |
-3B, -11T | Load zero then shift = zero with logic flags |
SRL A : SLL A |
OR 01h |
-2B, -9T | Shift right then undoc-shift-left = set bit 0 |
CPL : NEG |
SUB 0FFh |
-1B, -5T | Complement then negate = increment with sub flags |
SCF : ADC A, 00h |
ADD A, 01h |
-1B, -4T | Set carry then add-with-carry zero = add 1 |
RES 0, L : INC HL |
SET 0, L |
-1B, -6T | Clear bit then increment = just set the bit |
RES 7, A : AND A |
AND 7Fh |
-1B, -5T | Clear bit 7 then flag-set = mask to 7 bits |
SET 0, A : OR A |
OR 01h |
-1B, -5T | Set bit 0 then flag-set = OR with 1 |
LD A, 7Fh : SLL A |
OR 0FFh |
-2B, -8T | Load 0x7F then undoc-shift = set all bits |
ADD A, 00h : RL A |
SLA A |
-2B, -7T | Clear carry then rotate = shift left |
The full results are in rules.json (743K+ rules, search 96% complete).
455 opcodes across 5 implementation waves:
| Wave | Opcodes | What |
|---|---|---|
| V1 | 206 | 8-bit loads, ALU, rotates/shifts, specials (NOP, DAA, CPL, NEG, SCF, CCF) |
| Wave 1 | +174 | BIT/RES/SET n,r — all CB-prefix bit manipulation |
| Wave 2 | +14 | 16-bit pair ops (INC/DEC rr, ADD HL,rr, EX DE,HL) |
| Wave 4 | +12 | LD rr,nn, ADC/SBC HL,rr (ED prefix) |
| Wave 5 | +61 | Memory ops: LD r,(HL), LD (HL),r, ALU (HL), INC/DEC (HL), BIT/RES/SET (HL), rotates (HL), LD A,(BC/DE), LD (BC/DE),A |
Target search space: 4,215 instructions per position (8-bit register ops with immediates). Candidate search space: 266,359 instructions per position (including 16-bit immediates).
The Z80 accesses memory through register-pair pointers: (HL), (BC), (DE). Rather than emulating a full 64KB address space, we model memory as a single virtual byte M in the state. All indirect memory instructions read/write this same M register:
State: A, F, B, C, D, E, H, L, SP, M (11 bytes)
| Instruction | Effect |
|---|---|
LD A, (HL) |
A = M |
LD (HL), A |
M = A |
ADD A, (HL) |
A += M (with flags) |
INC (HL) |
M++ (with flags) |
BIT 3, (HL) |
test bit 3 of M |
LD A, (BC) |
A = M (same address assumption) |
LD (DE), A |
M = A (same address assumption) |
Prerequisite: all memory-accessing instructions in a sequence must target the same physical address. The user applying an optimization is responsible for verifying this — the superoptimizer proves correctness under the same-address assumption.
This unlocks optimizations involving memory, e.g., LD A,(HL) : INC A : LD (HL),A patterns and memory-register interactions that pure register search cannot find.
- MH (high memory byte): for 16-bit memory ops (PUSH/POP, LD (nn),HL, EX (SP),HL) — M becomes ML+MH
- S0, S1 (stack bytes): for PUSH/POP — tracks 2 bytes of stack
- Masking: unused memory fields (MH, S0, S1) are masked out in comparisons, like dead-flags masking. Smaller state = faster search.
Different subsets of the instruction set enable focused search with dramatically smaller search spaces:
| Core | Registers | Instructions | Search space (len-3) |
|---|---|---|---|
| Full | A,F,B,C,D,E,H,L,SP,M | ~4,500 | 91 billion |
| Register-only | A,F,B,C,D,E,H,L,SP | ~4,215 | 74.8 billion |
| AF (accumulator) | A,F | ~80 | 512K |
| HL+M (pointer) | H,L,M + A,F | ~150 | 3.4 million |
The AF core is particularly promising — it covers all accumulator ALU ops, rotates, DAA, NEG, SCF, CCF. With only ~80 instructions, length-5 brute force becomes feasible (~3.3 billion targets), enabling discovery of 5→1 and 5→2 accumulator optimizations that are completely out of reach for full search.
Every optimization is provably correct — verified against all possible input states, not just test cases. This is qualitatively different from hand-written peephole rules.
Flag behavior is ported from remogatto/z80, including:
- Undocumented flags (bits 3 and 5)
- Half-carry lookup tables for 8-bit and 16-bit arithmetic
- Overflow detection via lookup tables
- Correct DAA implementation
- Undocumented SLL instruction (CB 30-37)
The CUDA and Go implementations are verified bit-exact against each other across all 394 opcodes and all 8 test vectors.
# Build the Go CLI
go build -o z80opt ./cmd/z80opt
# Find optimal replacement for a specific sequence
z80opt target "AND 0xFF"
z80opt target "SUB A : LD A, 0"
# CPU brute-force enumeration
z80opt enumerate --max-target 2 --output rules.json -v
# With dead-flags relaxation
z80opt enumerate --max-target 2 --dead-flags all --output rules-deadflags.json -v
# STOKE stochastic search for longer sequences
z80opt stoke --target "LD A, 0" --dead-flags all -v
z80opt stoke --target "AND 0xFF : ADD A, B" --chains 8 --iterations 1000000
# Verify rules
z80opt verify rules.json
# Verify CUDA JSONL results against CPU
z80opt verify-jsonl results.jsonl
# CUDA GPU search (requires nvcc)
nvcc -O2 -o cuda/z80search cuda/z80_search.cu
cuda/z80search --max-target 2 > results.jsonlcmd/z80opt/ CLI (enumerate, target, verify, verify-jsonl, export, stoke)
pkg/cpu/ Z80 state (11 bytes: regs + M) + executor (2.7ns/op, 0 alloc)
pkg/inst/ Instruction catalog (455 opcodes, encoding, timing)
pkg/search/ Verifier, enumerator, pruner, fingerprint map, workers
pkg/stoke/ STOKE stochastic superoptimizer (MCMC search)
pkg/gpu/ GPU integration layer (CUDA process, search orchestration)
pkg/gpu/shader/ WGSL compute shader (1171 lines, full Z80 executor)
pkg/result/ Rule storage, checkpoint, JSON output
cuda/ CUDA kernels and standalone search binaries
z80_common.h Shared Z80 executor, flag tables, test vectors (8 QC + 24 MidCheck)
z80_quickcheck.cu GPU QuickCheck kernel (pipe mode for Go interop)
z80_search.cu v1 standalone search (per-target dispatch, CPU ExhaustiveCheck)
z80_search_v2.cu v2 batched pipeline (512-target batches, GPU ExhaustiveCheck)
docs/ Research roadmap, ADRs, implementation plan
Length 1: 4,215 targets → trivial
Length 2: 4,215^2 = 17.8M targets → 3h CPU, ~6.5min GPU v2 96% DONE
Length 3: 4,215^3 = 74.8B targets → months CPU, hours GPU v2 next
Length 4: 4,215^4 = 315T targets → STOKE only
Length 5+: combinatorial explosion → STOKE only
- Length-3 GPU search — 74.8 billion targets.
--no-exhaustmode outputs MidCheck survivors in ~5 hours (dual GPU), with ExhaustiveCheck verification distributed separately. Estimated ~60M candidates from which ~30M confirmed optimizations. - Memory-aware search — Wave 5 memory ops integrated into CPU executor and verifier. CUDA GPU search and STOKE next.
- Focused search cores — AF-only core enables length-5 brute force (~3.3B targets). HL+M core enables memory optimization search at manageable scale.
- PUSH/POP + 16-bit memory — Wave 6: add MH, S0, S1 to state with masking. Covers PUSH/POP, EX (SP),HL, LD (nn),HL.
- STOKE + memory — stochastic search for longer sequences (5→3, 8→4) with memory instructions. Feed real Z80 code snippets as targets.
- Distributed verification — split MidCheck candidate JSONL files for parallel CPU verification across machines. Community-distributable.
- Reordering optimizer — apply discovered rules to real Z80 code via dependency DAG analysis. Handles interleaved unrelated instructions.
We're expanding from instruction-level optimization to function-level and algorithmic-level bruteforce. See BRUTEFORCE_ROADMAP.md for the full plan.
Shipped — complete:
- Peephole rules — 739K len-2→len-1 (complete), 37M len-3 (partial)
- Register allocation — 83.6M shapes (37.6M feasible), enriched with 15 op-aware metrics. O(1) lookup integrated into MinZ-VIR compiler
- Constant multiply ×2..×255 — 254/254 ALL DIRECT, 8-bit and 16-bit tables
- Constant division ÷2..÷255 — 254/254, 6 methods, avg 79T. GPU-discovered
carry_comparefor K≥128 - Modulo ÷2..÷255 — 254/254 mod8, 254/254 divmod8
- Branchless library — sign8 (43T), sat_add8 (16T!), sat_sub8 (20T), ABS/MIN/MAX/CMOV
- 16-bit arithmetic — abs16 (44T), neg16 (27T), min16/max16 (41T), sign16 (20–34T)
- 32-bit arithmetic — 13 u32 operations (DEHL convention), SHL32/SHR32 proven optimal
- SHA-256 decomposition — 58ms/block @3.5MHz from u32 primitives
- Image search — 3 CUDA generators, 29-experiment gallery
Planned targets:
- ZX Spectrum screen address — Y→VRAM address, the holy grail of Spectrum programming
- Bit reverse, popcount, BCD conversion — bit manipulation functions
- Approximate sin/cos — polynomial approximation in minimal Z80 instructions (allow ±2 error)
- HL×K constant multiply — 16-bit × constant, full 16-bit result
Philosophy: "Let the GPU try everything. Ship the winners."
See TODO.md for the full roadmap with effort estimates, priority matrix, and milestone history. See docs/NEXT.md for search architecture diagrams and academic references.
This project builds on five generations of superoptimizer research:
| Year | Work | Approach |
|---|---|---|
| 1987 | Massalin | Exhaustive enumeration (the original superoptimizer) |
| 2002 | Denali | Goal-directed search, equality saturation |
| 2006 | Bansal & Aiken | Automatic peephole rule generation for GCC |
| 2013 | STOKE | MCMC stochastic search, outperforms gcc -O3 |
| 2016 | Lens/Souper | Decomposition + SMT solving |
Most prior superoptimizer work targets x86/x86-64 or custom IR. This project combines several ideas that, to our knowledge, haven't been applied together to a retro ISA:
-
GPU-accelerated search pipeline — porting the full Z80 executor to CUDA for both QuickCheck (fingerprint filtering) and ExhaustiveCheck (exhaustive verification). Prior work (STOKE, Lens) runs on CPU only. The Z80's small fixed-size state (11 bytes including virtual memory) makes it unusually GPU-friendly.
-
Three-tier verification — QuickCheck (8 vectors, GPU) → MidCheck (32 vectors, GPU) → ExhaustiveCheck (full sweep, GPU) — all three stages run on GPU, eliminating false positives progressively. Combined with STOKE for sequences beyond brute-force range.
-
Dead-flags optimization tier — discovering optimizations that are only valid when certain flags are dead, tagged with the exact flag mask. This bridges the gap between the superoptimizer (which proves equivalence) and the peephole optimizer (which needs liveness information to apply flag-clobbering rules safely).
-
Complete Z80 flag accuracy — including undocumented bits 3 and 5, half-carry lookup tables, and the undocumented SLL instruction. Most Z80 tools skip these; we need them because the superoptimizer must prove exact equivalence.
-
Memory as virtual register — modeling indirect memory access
(HL)as a virtual register M, with same-address assumption. This extends superoptimization to memory-touching instructions without requiring a full memory model, adding 61 opcodes with minimal state growth. -
Scale — 761,621+ provably correct optimizations from a single ISA, with GPU search enabling complete coverage of length-3 sequences (74.8 billion targets). This is significantly more rules than prior peephole superoptimizer work has produced.
Whether this constitutes a publishable contribution depends on the venue — a workshop paper at CGO, CC, or a retro-computing venue like VCFW could work. The GPU-accelerated QuickCheck technique generalizes to any small-state ISA (6502, 8080, ARM Thumb subset, RISC-V compressed).
Superoptimization itself is well-established (1987+), and GPU compute is standard. The specific combination — GPU-parallel QuickCheck with dead-flags tagging for retro ISAs — is likely too incremental for a utility patent. More importantly, keeping it open source under MIT maximizes impact. Anyone optimizing Z80/6502/8080 code can use the technique and the 743K+ rules directly.
The project has expanded from instruction-level peephole optimization to three additional exhaustive search domains:
GPU brute-force over all possible register allocation constraint shapes. For each combination of virtual register count, widths, location sets, and interference graph, find the provably optimal physical register assignment — or prove that no valid assignment exists.
| Level | Shapes | GPU time | Feasible |
|---|---|---|---|
| ≤4 vregs | 156,506 | 40 sec | 78.9% |
| ≤5 vregs | 17,366,874 | 20 min | 67.7% |
| 6 vregs (dense, tw≥4) | 66,118,738 | ~6 hours | TBD |
Key findings:
- Feasibility phase transition: 95.9% feasible at 2 vregs → 0.9% at 6 vregs. The Z80 register file "fills up" — 99.1% of all possible 6-vreg constraint shapes have no valid assignment.
- Treewidth analysis: 99.5% of random interference graphs have treewidth ≤3 (classically tractable). But compiler-generated graphs are denser — 53.7% of real-world dense functions have treewidth ≥4.
- Composition verification: Splitting 5-vreg shapes at cut vertices and composing from the 4-vreg table works with max 12 T-state overhead. Verified on 13.2M shapes with zero misses.
- Five-level pipeline: Table lookup (O(1)) → graph decomposition → GPU brute-force → CPU backtracking (1000-4000x pruning) → island decomposition + Z3. Every function hits one of these levels.
Tables are in data/ as compact binary + zstd compression (8.5MB for ≤5v). See data/README.md for the binary format spec and reader examples.
# Generate all possible constraint shapes for ≤5 virtual registers
./regalloc-enum --max-vregs 5 > shapes.jsonl
# GPU-solve each shape (finds optimal assignment or proves infeasible)
cat shapes.jsonl | cuda/z80_regalloc --server > results.jsonl
# For 6v: only enumerate dense shapes (treewidth ≥ 4) — 1.7% of 6v space
./regalloc-enum --max-vregs 6 --only-nv 6 --dense-masks dense_6v_masks.txt > shapes_6v.jsonlGPU search for the shortest Z80 instruction sequence that computes A × K for each constant K (2-255).
| Search depth | Constants solved | Pool | GPU time |
|---|---|---|---|
| ≤8 instructions | 103 / 254 | 21 ops | ~30 sec |
| ≤9 instructions | 164 / 254 | 14 ops (reduced) | ~3 min |
| ≤10 instructions | 249 / 254 | 14 ops | ~15 hours |
| ≤11 instructions | 254 / 254 | 14 ops | ~48 hours |
All 254 constants solved by direct GPU search — no composition needed. Last 5 constants (170, 171, 173, 179, 181) found at length 11. Every multiply sequence is provably optimal.
Instruction pool reduction: Analysis of which instructions appear in optimal solutions revealed that 7 of 21 candidate ops are never used. Removing them shrinks the search space by 38x at length 9 (21^9 → 14^9). The removed ops are strictly dominated (e.g., SLA A = ADD A,A but costs 8T instead of 4T) or never contribute to optimal sequences (OR A, SCF, EX AF,AF').
Notable results:
×255 = NEG(1 instruction, 8T) — negate is multiply by -1 (mod 256)×252 = RLA : NEG : ADD A,A(3 instructions, 16T)×127 = RRCA : LD B,A : ADC A,B : NEG : RRA : SUB B(6 instructions, 28T)
# Find optimal multiply for all constants (14-op reduced pool)
cuda/z80_mulopt_fast --max-len 9 --json > mulopt_results.json
# Search for a specific constant
cuda/z80_mulopt_fast --k 42 --max-len 10Analytical multiply-and-shift: A÷K = (A × M) >> S composed from mul16 table.
- 254/254 divisors solved via 6 methods (v3): shift(5), mul_shift(30), preshift_mul(36), mul_add256_shift(41), double_mul_shift(15), carry_compare(127)
- T-states: 8–188 (avg 79T). Beats SDCC generic __div8 (80–200T). All exhaustively verified, 4× cross-verified.
- carry_compare (K≥128):
OR A; LD B,(256-K); ADC A,B; SBC A,A; AND 1= 5 ops, 26T — GPU-discovered, not in any Z80 reference - PRESHIFT (K<128):
(A>>P)×M>>S— exploit K's power-of-2 factor. div86 = 60T (was 170T) - div3 = A×171>>9 (16 ops, 141T). div10 = (A>>1)×103>>9 (15 ops, 138T). div7 = A×293>>11 (16 ops, 128T).
- mod8 and divmod8 tables also complete (254/254 each).
- GPU brute-force confirmed: div3/div10 not findable at max-len=8 (need 9+ instructions = beyond brute-force window).
Division by powers of 2 is trivial (SRL A repeated). Modulo by powers of 2 is trivial (AND mask).
# Search for division by constant
cuda/z80_divmod_fast --div 10 --max-len 10 --json
# Search for modulo by constant
cuda/z80_divmod_fast --mod 7 --max-len 10 --json
# Search for both quotient and remainder
cuda/z80_divmod_fast --divmod 10 --max-len 10 --jsonThree GPUs across three machines, running searches in parallel:
| Machine | GPU | VRAM | API | Use |
|---|---|---|---|---|
| main | 2× RTX 4060 Ti | 16GB each | CUDA 12.0 | Regalloc tables, peephole search |
| secondary | RTX 2070 | 8GB | CUDA 12.0 | Mulopt, divmod search |
| AMD node | Radeon RX 580 | 8GB | OpenCL + Vulkan | Cross-vendor verification |
| Apple M2 | M2 (10-core GPU) | 24GB unified | Metal 3 | CPU baseline, Metal cross-verification |
docs/glossary.md— Complete glossary of all terms and abbreviations used in the projectdocs/paper_seed_superopt.md— Paper/book seed: instruction pool reduction, register expansion analysis, meet-in-the-middle strategies, cross-architecture transfer, phase transitionsdocs/research_statement.md— Research framing with phase diagram datadata/README.md— Binary table format spec with Python/Go reader examples
- Massalin 1987 — Superoptimizer: A Look at the Smallest Program. ASPLOS '87.
- Bansal & Aiken 2006 — Automatic Generation of Peephole Superoptimizers. ASPLOS '06.
- STOKE 2013 — Stochastic Superoptimization. ASPLOS '13.
- Lens 2016 — Scaling up Superoptimization. ASPLOS '16.
- remogatto/z80 — Z80 emulator (flag behavior reference).
MIT


