z80-optimizer

A GPU-accelerated superoptimizer for the Zilog Z80 processor. The compiler that never guesses — every optimization is provably optimal.

What's New (Birthday Marathon — March 26–29, 2026)

Week 1 Highlights

div8 254/254 COMPLETE — 6 methods, avg 79T (beats SDCC's 80–200T). GPU-discovered carry_compare trick: 5 ops, 26T for K≥128
1500+ optimal arithmetic sequences: 254 mul8 + 254 mul16 + 254 div8 + 254 mod8 + 13 u32 ops + 25 branchless idioms
sat_add8 = 16T — ADD A,B; LD C,A; SBC A,A; OR C — 4-instruction branchless saturating add
SHA-256 on Z80: 58ms/block @3.5MHz from proven u32 primitives (13 operations, SHL32/SHR32 optimal)
3 CUDA image generators for ZX Spectrum demoscene: cat 4.9%, Che Guevara 15%, Einstein 15.1% — gallery
O(1) regalloc integrated into MinZ-VIR compiler — 156K enriched table entries, zero Z3
4× cross-verified: z80-optimizer + MinZ + MinZ-VIR + MinZ-ABAP

Day-by-Day

Day	Date	Theme	Key Achievement
1	Mar 26	Release	v1.0.0, article, 500+ sequences
2	Mar 27	Expansion	FP16/BCD, 21-op universal pool, multi-target search — log
3	Mar 28	Proofs	37.6M enriched shapes, Z flag proof, SBC A,A library — log
4	Mar 28	Regalloc	Five-level pipeline, backtracking solver, phase cliff — log
5	Mar 29	Images+u32	3 CUDA generators, u32 library, Introspec BB port — log
6	Mar 29	Division	div8 v3 (−49%), carry_compare, sign/sat/arith16 — log

pRNG Image Search — ZX Spectrum Demoscene

GPU-accelerated search for recognizable images from minimal data generators (128–1194 bytes). Three methods, all CUDA. See the full gallery (29 experiments).

Cat (4.9%, 128B)	Che Guevara (15%, 1194B)	Einstein (15.1%, 128B)

Dual-layer evolutionary: 5 layers (3 additive OR + 2 subtractive AND NOT), island model, 557K img/s
Segmented hierarchical LFSR: brute-force 65536 seeds per segment, guaranteed convergence
Introspec BB port: 24-bit Galois LFSR, 66 layers, CUDA port runs 4 min vs days on CPU

v1.0.0 Foundation (March 26)

83.6M exhaustive register allocations for ≤6 virtual registers (78MB enriched, 15 metrics per shape)
97.7% infeasibility at 7-15 vregs — proven that Z80 MUST decompose most real functions
Multi-backend DSL: one ISA definition → CUDA / Metal / OpenCL / Vulkan kernels
Cross-verified on 6 platforms, 4 APIs, 3 GPU vendors (NVIDIA CUDA + AMD Vulkan/OpenCL + Apple Metal 3)
739K peephole rules (len-2 complete, len-3 37M partial)

📄 Article: The Z80 Compiler That Never Guesses (PDF / EPUB / HTML)

📖 Book outline: 19 chapters — from peephole rules to universal computation chains

📊 Week 1 report: Birthday marathon results — 1500+ sequences, 12 data files, 4× cross-verified

📋 Roadmap: TODO.md — full task list with effort estimates and priority matrix

📓 Daily logs: day 3 · day 4 · day 5 · day 6 · day 7 seed

Given a sequence of Z80 instructions, it exhaustively searches for a shorter equivalent sequence that produces identical register, flag, and memory state for all possible inputs. No heuristics, no pattern databases — provably correct by construction.

Why?

Compilers and hand-written Z80 code contain suboptimal patterns that are hard to spot manually. Some are trivial (LD A, 0 is 2 bytes, XOR A is 1), but many are deeply non-obvious:

Original	Replacement	Savings	Why it works
`SLA A : RR A`	`OR A`	-3B, -12T	Shift left then rotate right = identity + flag set
`AND 00h : NEG`	`SUB A`	-3B, -11T	Zero then negate = zero with subtract flags
`ADD A, A : RR A`	`OR A`	-2B, -8T	Double then halve = identity + flag set
`XOR 0FFh : SBC A, 0FFh`	`NEG`	-2B, -6T	Complement + subtract-with-borrow = negate
`SCF : RL A`	`SLL A`	-1B, -4T	Set carry then rotate-through-carry = undoc shift
`SET 0, L : DEC HL`	`RES 0, L`	-1B, -6T	Set bit then decrement pair = just clear the bit
`ADD A, 80h : OR A`	`XOR 80h`	-1B, -4T	Flip sign bit with correct flag behavior
`AND 0FFh : RR A`	`SRL A`	-2B, -7T	Clear carry via AND then rotate = shift right

These are not things a human would think to try. The superoptimizer finds them because it tries everything.

How it works

The search has three tiers, each filtering more aggressively:

Enumerate all target sequences (e.g., 17.8M for length-2)
    |
    v
QuickCheck (GPU): 8 test vectors, 80-byte fingerprint
    |  rejects 99.99% of candidates instantly
    v
MidCheck (GPU): 24 additional test vectors (32 total)
    |  catches ~23% of QuickCheck false positives
    v
ExhaustiveCheck (GPU): sweep all input states (up to 2^24)
    |  256 threads/block, batched per pipeline flush
    |  proves exact equivalence for ALL inputs
    v
Confirmed optimization --> JSONL output

1. QuickCheck (GPU)

Run each candidate on 8 carefully chosen test vectors. If the output differs from the target on any vector, the candidate is definitely not equivalent. This eliminates 99.99%+ of the search space in microseconds.

The test vectors are fixed Z80 states covering edge cases: all zeros, all 0xFF, alternating bits, prime-derived values. Eight vectors is enough to reject nearly everything while keeping the fingerprint small (80 bytes).

On the GPU, targets are batched in groups of 512. Each batch tests all ~4,215 candidates simultaneously (512 × 4,215 = 2.1M CUDA threads), comparing against all target fingerprints in a single kernel launch via a bitmap output.

2. MidCheck (GPU)

QuickCheck's 8 vectors let through ~27% false positives, especially for BIT/RES/SET instructions whose effects are bit-position-specific. MidCheck runs 24 additional test vectors on QuickCheck survivors on the GPU:

Single-bit A values (0x01, 0x02, 0x04, ...) to distinguish BIT operations
Per-register bit patterns covering unique bit positions in B, C, D, E, H, L
Boundary values (0xBF/0xC0, 0x7F/0x80) to catch edge cases

MidCheck runs as a separate GPU kernel on QuickCheck survivors, eliminating most false positives before the expensive ExhaustiveCheck. The 24 extra vectors are defined alongside the original 8 in z80_common.h and verifier.go.

3. ExhaustiveCheck (GPU)

For the ~0-5 candidates that survive both QuickCheck and MidCheck, we prove equivalence by sweeping all possible input combinations on the GPU. Each (target, candidate) pair gets one thread block of 256 threads — thread i handles A=i and loops over carry and extra registers:

0 extra registers: 2 iterations per thread (carry only)
1 extra register: 512 iterations per thread (256 values x 2 carry)
2 extra registers: 131K iterations per thread (256 x 256 x 2)
3+ registers or SP: reduced sweep with 32 representative values per register

Pairs are batched (4096 at a time) and dispatched as a single kernel launch. Early termination uses shared memory atomicOr — once any thread finds a mismatch, all threads in the block stop.

If the target and candidate produce identical output (all 11 bytes: A, F, B, C, D, E, H, L, SP, M) for every input state, the optimization is provably correct.

4. Pruning

Before testing, we eliminate obviously redundant candidates:

NOP elimination: sequences containing NOP (unless targeting NOP)
Self-load detection: LD A, A etc.
Dead write analysis: an instruction's output is overwritten before being read
Canonical ordering: for independent instructions, enforce a canonical order to avoid duplicate sequences

Pruning uses register dependency bitmasks (opReads/opWrites) for all 394 opcodes.

GPU acceleration (CUDA)

The Z80 executor is an ideal GPU workload: fixed-size 11-byte state (10 registers + 1 virtual memory byte), pure ALU computation, embarrassingly parallel. Both QuickCheck filtering and ExhaustiveCheck verification run on GPU.

Architecture (v2 batched pipeline)

 CPU: enumerate targets      batch 512 targets
   in groups of 512               |
         |                        v
 CPU: compute fingerprints ──> GPU Kernel 1: QuickCheck (batched)
   (8 QC + 24 Mid vectors)       512 targets × 4215 candidates = 2.1M threads
                                  bitmap output: one bit per candidate per target
                                  atomicOr for lock-free writes
                                           |
                                           v
                              CPU: collect QC hits, prune, build MidCheck pairs
                                           |
                                           v
                              GPU Kernel 2: MidCheck
                                  one thread per (target, candidate) pair
                                  24 additional test vectors
                                  rejects ~23% of QC survivors
                                           |
                                           v
                              GPU Kernel 3: ExhaustiveCheck
                                  256 threads/block, one block per pair
                                  thread i: A=i, loop carry+regs
                                  shared memory early termination
                                           |
                                           v
                              CPU: output confirmed results as JSONL

Performance

Approach	Length-2 time	Results	Speedup
CPU brute force (Apple M2, 24GB)	3h 16m	602,008	1x
CPU brute force (i7, projected)	~6h	602,008	0.5x
Metal compute (Apple M2, 10-core GPU)	cross-verification	—	—
CUDA v2 (RTX 4060 Ti)	~6.5 min (95.5% of search)	743,309	~30x
CUDA v2 (2x RTX 4060 Ti)	~90 min (100% incl. BIT/SET/RES)	est. ~950K	—

The CUDA search is a complete standalone binary with GPU QuickCheck, GPU MidCheck, GPU ExhaustiveCheck, pruning, disassembly, and JSONL output. All results verified 100% correct against Go CPU ExhaustiveCheck.

Length-2 search breakdown (RTX 4060 Ti, single GPU):

Opcodes 0-4027 (95.5%):   386s — ALU, LD, rotate/shift, INC/DEC, most BIT/RES/SET
  11.1M targets tested, 739,249 optimizations found
  QC: 1,009,124  →  Mid: 774,347  →  Exhaustive: 734,905  →  confirmed: 734,905

Opcodes 4028+ (4.5%):     heavy BIT/RES/SET area — ~60s per opcode (single GPU)
  Each opcode generates ~3K QC hits → ~2.3K MidCheck survivors → ~2.2K confirmed
  Bottleneck: nextra=2-4 registers requiring 131K-2M iterations/thread in GPU ExhaustiveCheck

The first 95% completes in ~6.5 minutes. The BIT/SET/RES tail (ops 4028-4215) is the remaining bottleneck — these opcodes test individual bit positions, creating many near-equivalent candidates that require expensive exhaustive verification with 2-4 extra register sweeps. Dual-GPU parallelism halves this tail.

False positive pipeline

Stage	Test vectors	Survivors	Rejection rate
QuickCheck (GPU)	8	1,015,036 of 11.1M targets	99.99% of candidates rejected
MidCheck (GPU)	+24 (32 total)	778,892	23% of QC hits rejected
ExhaustiveCheck (GPU)	full sweep (up to 2^24)	739,249 confirmed	5% of Mid hits rejected

The 24 MidCheck vectors include single-bit A values, per-register bit patterns, and boundary values, specifically targeting BIT/RES/SET false positives that share identical 8-vector fingerprints. Overall false positive rate from QuickCheck to confirmed: 27%.

Building and running

# Build CUDA search v2 (3-stage batched pipeline)
nvcc -O2 -o cuda/z80search_v2 cuda/z80_search_v2.cu

# Run length-2 search (output JSONL to stdout, progress to stderr)
cuda/z80search_v2 --max-target 2 > results.jsonl 2>progress.log

# Dual GPU: split the outer loop across GPUs
cuda/z80search_v2 --max-target 2 --gpu-id 0 --first-op-end 2107 > r0.jsonl 2>log0.txt &
cuda/z80search_v2 --max-target 2 --gpu-id 1 --first-op-start 2107 > r1.jsonl 2>log1.txt &
wait
cat r0.jsonl r1.jsonl > results-all.jsonl

# Dead-flags relaxation
cuda/z80search_v2 --max-target 2 --dead-flags 0x28 > results-deadflags.jsonl 2>log.txt

# Verify CUDA results against CPU reference implementation
z80opt verify-jsonl results.jsonl

Verified correctness

All CUDA results are verified bit-exact against the Go CPU implementation:

743,309 length-2 results produced by v2 GPU pipeline (96% of search complete)
5,060 results verified against CPU ExhaustiveCheck — 100% pass rate (1,000 random sample + 4,060 from GPU 1 partition)
23,772 early results (v1 run) cross-verified — 100% agreement

The CUDA kernel and Go executor were developed independently and tested against each other across all 394 opcodes. The flag tables, ALU operations, and DAA implementation match exactly.

STOKE stochastic search

For sequences beyond length 3, brute force is infeasible (74.8 billion targets at length 3, 315 trillion at length 4). We implement STOKE-style MCMC stochastic search:

Start from the target sequence (or random)
Randomly mutate: replace instruction, swap, insert, delete, change immediate
Evaluate cost: correctness (test vector mismatches) + size (byte count)
Accept/reject via Metropolis-Hastings (always accept improvements, sometimes accept regressions)
Repeat until a shorter, fully-correct candidate is found

z80opt stoke --target "LD A, 0" --dead-flags all -v
z80opt stoke --target "AND 0xFF : ADD A, B" --chains 8 --iterations 1000000

Dead-flags optimization

The default full-equivalence mode correctly rejects LD A, 0 -> XOR A because XOR A modifies flags while LD A, 0 does not. The --dead-flags option adds a second tier of rules tagged with which flag bits must be "dead" (not read before being overwritten) for the rule to be valid.

# Registers-only equivalence (ignore all flags)
z80opt stoke --target "LD A, 0" --dead-flags all -v
# Result: LD A, 0 -> XOR A (-1 byte, dead flags: all)

# Ignore only undocumented flag bits 3 and 5
z80opt enumerate --max-target 2 --dead-flags undoc -v

The consumer (peephole optimizer) uses liveness analysis to determine which flags are dead at each program point before applying these rules.

Results

Length-2 brute force: 743,309 optimizations (96% complete)

From 11.1M length-2 target sequences tested (96% of search space). 743,309 provably correct optimizations found.

Bytes saved	Count
3 bytes	1,212
2 bytes	580,937+
1 byte	19,859+

More highlights:

Original	Replacement	Savings	Insight
`SRL A : RL A`	`OR A`	-3B, -12T	Shift right then rotate left = identity + flag set
`LD A, 00h : SLA A`	`XOR A`	-3B, -11T	Load zero then shift = zero with logic flags
`SRL A : SLL A`	`OR 01h`	-2B, -9T	Shift right then undoc-shift-left = set bit 0
`CPL : NEG`	`SUB 0FFh`	-1B, -5T	Complement then negate = increment with sub flags
`SCF : ADC A, 00h`	`ADD A, 01h`	-1B, -4T	Set carry then add-with-carry zero = add 1
`RES 0, L : INC HL`	`SET 0, L`	-1B, -6T	Clear bit then increment = just set the bit
`RES 7, A : AND A`	`AND 7Fh`	-1B, -5T	Clear bit 7 then flag-set = mask to 7 bits
`SET 0, A : OR A`	`OR 01h`	-1B, -5T	Set bit 0 then flag-set = OR with 1
`LD A, 7Fh : SLL A`	`OR 0FFh`	-2B, -8T	Load 0x7F then undoc-shift = set all bits
`ADD A, 00h : RL A`	`SLA A`	-2B, -7T	Clear carry then rotate = shift left

The full results are in rules.json (743K+ rules, search 96% complete).

Instruction coverage

455 opcodes across 5 implementation waves:

Wave	Opcodes	What
V1	206	8-bit loads, ALU, rotates/shifts, specials (NOP, DAA, CPL, NEG, SCF, CCF)
Wave 1	+174	BIT/RES/SET n,r — all CB-prefix bit manipulation
Wave 2	+14	16-bit pair ops (INC/DEC rr, ADD HL,rr, EX DE,HL)
Wave 4	+12	LD rr,nn, ADC/SBC HL,rr (ED prefix)
Wave 5	+61	Memory ops: LD r,(HL), LD (HL),r, ALU (HL), INC/DEC (HL), BIT/RES/SET (HL), rotates (HL), LD A,(BC/DE), LD (BC/DE),A

Target search space: 4,215 instructions per position (8-bit register ops with immediates). Candidate search space: 266,359 instructions per position (including 16-bit immediates).

Memory model (Wave 5)

The Z80 accesses memory through register-pair pointers: (HL), (BC), (DE). Rather than emulating a full 64KB address space, we model memory as a single virtual byte M in the state. All indirect memory instructions read/write this same M register:

State: A, F, B, C, D, E, H, L, SP, M   (11 bytes)

Instruction	Effect
`LD A, (HL)`	A = M
`LD (HL), A`	M = A
`ADD A, (HL)`	A += M (with flags)
`INC (HL)`	M++ (with flags)
`BIT 3, (HL)`	test bit 3 of M
`LD A, (BC)`	A = M (same address assumption)
`LD (DE), A`	M = A (same address assumption)

Prerequisite: all memory-accessing instructions in a sequence must target the same physical address. The user applying an optimization is responsible for verifying this — the superoptimizer proves correctness under the same-address assumption.

This unlocks optimizations involving memory, e.g., LD A,(HL) : INC A : LD (HL),A patterns and memory-register interactions that pure register search cannot find.

Future extensions

MH (high memory byte): for 16-bit memory ops (PUSH/POP, LD (nn),HL, EX (SP),HL) — M becomes ML+MH
S0, S1 (stack bytes): for PUSH/POP — tracks 2 bytes of stack
Masking: unused memory fields (MH, S0, S1) are masked out in comparisons, like dead-flags masking. Smaller state = faster search.

Search cores

Different subsets of the instruction set enable focused search with dramatically smaller search spaces:

Core	Registers	Instructions	Search space (len-3)
Full	A,F,B,C,D,E,H,L,SP,M	~4,500	91 billion
Register-only	A,F,B,C,D,E,H,L,SP	~4,215	74.8 billion
AF (accumulator)	A,F	~80	512K
HL+M (pointer)	H,L,M + A,F	~150	3.4 million

The AF core is particularly promising — it covers all accumulator ALU ops, rotates, DAA, NEG, SCF, CCF. With only ~80 instructions, length-5 brute force becomes feasible (~3.3 billion targets), enabling discovery of 5→1 and 5→2 accumulator optimizations that are completely out of reach for full search.

Correctness

Every optimization is provably correct — verified against all possible input states, not just test cases. This is qualitatively different from hand-written peephole rules.

Flag behavior is ported from remogatto/z80, including:

Undocumented flags (bits 3 and 5)
Half-carry lookup tables for 8-bit and 16-bit arithmetic
Overflow detection via lookup tables
Correct DAA implementation
Undocumented SLL instruction (CB 30-37)

The CUDA and Go implementations are verified bit-exact against each other across all 394 opcodes and all 8 test vectors.

Usage

# Build the Go CLI
go build -o z80opt ./cmd/z80opt

# Find optimal replacement for a specific sequence
z80opt target "AND 0xFF"
z80opt target "SUB A : LD A, 0"

# CPU brute-force enumeration
z80opt enumerate --max-target 2 --output rules.json -v

# With dead-flags relaxation
z80opt enumerate --max-target 2 --dead-flags all --output rules-deadflags.json -v

# STOKE stochastic search for longer sequences
z80opt stoke --target "LD A, 0" --dead-flags all -v
z80opt stoke --target "AND 0xFF : ADD A, B" --chains 8 --iterations 1000000

# Verify rules
z80opt verify rules.json

# Verify CUDA JSONL results against CPU
z80opt verify-jsonl results.jsonl

# CUDA GPU search (requires nvcc)
nvcc -O2 -o cuda/z80search cuda/z80_search.cu
cuda/z80search --max-target 2 > results.jsonl

Project structure

cmd/z80opt/          CLI (enumerate, target, verify, verify-jsonl, export, stoke)
pkg/cpu/             Z80 state (11 bytes: regs + M) + executor (2.7ns/op, 0 alloc)
pkg/inst/            Instruction catalog (455 opcodes, encoding, timing)
pkg/search/          Verifier, enumerator, pruner, fingerprint map, workers
pkg/stoke/           STOKE stochastic superoptimizer (MCMC search)
pkg/gpu/             GPU integration layer (CUDA process, search orchestration)
pkg/gpu/shader/      WGSL compute shader (1171 lines, full Z80 executor)
pkg/result/          Rule storage, checkpoint, JSON output
cuda/                CUDA kernels and standalone search binaries
  z80_common.h         Shared Z80 executor, flag tables, test vectors (8 QC + 24 MidCheck)
  z80_quickcheck.cu    GPU QuickCheck kernel (pipe mode for Go interop)
  z80_search.cu        v1 standalone search (per-target dispatch, CPU ExhaustiveCheck)
  z80_search_v2.cu     v2 batched pipeline (512-target batches, GPU ExhaustiveCheck)
docs/                Research roadmap, ADRs, implementation plan

The search space

Length 1:  4,215 targets           → trivial
Length 2:  4,215^2 = 17.8M targets → 3h CPU, ~6.5min GPU v2    96% DONE
Length 3:  4,215^3 = 74.8B targets → months CPU, hours GPU v2  next
Length 4:  4,215^4 = 315T targets  → STOKE only
Length 5+: combinatorial explosion → STOKE only

What's next

In progress

Length-3 GPU search — 74.8 billion targets. --no-exhaust mode outputs MidCheck survivors in ~5 hours (dual GPU), with ExhaustiveCheck verification distributed separately. Estimated ~60M candidates from which ~30M confirmed optimizations.
Memory-aware search — Wave 5 memory ops integrated into CPU executor and verifier. CUDA GPU search and STOKE next.

Planned

Focused search cores — AF-only core enables length-5 brute force (~3.3B targets). HL+M core enables memory optimization search at manageable scale.
PUSH/POP + 16-bit memory — Wave 6: add MH, S0, S1 to state with masking. Covers PUSH/POP, EX (SP),HL, LD (nn),HL.
STOKE + memory — stochastic search for longer sequences (5→3, 8→4) with memory instructions. Feed real Z80 code snippets as targets.
Distributed verification — split MidCheck candidate JSONL files for parallel CPU verification across machines. Community-distributable.
Reordering optimizer — apply discovered rules to real Z80 code via dependency DAG analysis. Handles interleaved unrelated instructions.

Bruteforce beyond peephole

We're expanding from instruction-level optimization to function-level and algorithmic-level bruteforce. See BRUTEFORCE_ROADMAP.md for the full plan.

Shipped — complete:

Peephole rules — 739K len-2→len-1 (complete), 37M len-3 (partial)
Register allocation — 83.6M shapes (37.6M feasible), enriched with 15 op-aware metrics. O(1) lookup integrated into MinZ-VIR compiler
Constant multiply ×2..×255 — 254/254 ALL DIRECT, 8-bit and 16-bit tables
Constant division ÷2..÷255 — 254/254, 6 methods, avg 79T. GPU-discovered carry_compare for K≥128
Modulo ÷2..÷255 — 254/254 mod8, 254/254 divmod8
Branchless library — sign8 (43T), sat_add8 (16T!), sat_sub8 (20T), ABS/MIN/MAX/CMOV
16-bit arithmetic — abs16 (44T), neg16 (27T), min16/max16 (41T), sign16 (20–34T)
32-bit arithmetic — 13 u32 operations (DEHL convention), SHL32/SHR32 proven optimal
SHA-256 decomposition — 58ms/block @3.5MHz from u32 primitives
Image search — 3 CUDA generators, 29-experiment gallery

Planned targets:

ZX Spectrum screen address — Y→VRAM address, the holy grail of Spectrum programming
Bit reverse, popcount, BCD conversion — bit manipulation functions
Approximate sin/cos — polynomial approximation in minimal Z80 instructions (allow ±2 error)
HL×K constant multiply — 16-bit × constant, full 16-bit result

Philosophy: "Let the GPU try everything. Ship the winners."

See TODO.md for the full roadmap with effort estimates, priority matrix, and milestone history. See docs/NEXT.md for search architecture diagrams and academic references.

Research context

This project builds on five generations of superoptimizer research:

Year	Work	Approach
1987	Massalin	Exhaustive enumeration (the original superoptimizer)
2002	Denali	Goal-directed search, equality saturation
2006	Bansal & Aiken	Automatic peephole rule generation for GCC
2013	STOKE	MCMC stochastic search, outperforms gcc -O3
2016	Lens/Souper	Decomposition + SMT solving

What's novel here

Most prior superoptimizer work targets x86/x86-64 or custom IR. This project combines several ideas that, to our knowledge, haven't been applied together to a retro ISA:

GPU-accelerated search pipeline — porting the full Z80 executor to CUDA for both QuickCheck (fingerprint filtering) and ExhaustiveCheck (exhaustive verification). Prior work (STOKE, Lens) runs on CPU only. The Z80's small fixed-size state (11 bytes including virtual memory) makes it unusually GPU-friendly.
Three-tier verification — QuickCheck (8 vectors, GPU) → MidCheck (32 vectors, GPU) → ExhaustiveCheck (full sweep, GPU) — all three stages run on GPU, eliminating false positives progressively. Combined with STOKE for sequences beyond brute-force range.
Dead-flags optimization tier — discovering optimizations that are only valid when certain flags are dead, tagged with the exact flag mask. This bridges the gap between the superoptimizer (which proves equivalence) and the peephole optimizer (which needs liveness information to apply flag-clobbering rules safely).
Complete Z80 flag accuracy — including undocumented bits 3 and 5, half-carry lookup tables, and the undocumented SLL instruction. Most Z80 tools skip these; we need them because the superoptimizer must prove exact equivalence.
Memory as virtual register — modeling indirect memory access (HL) as a virtual register M, with same-address assumption. This extends superoptimization to memory-touching instructions without requiring a full memory model, adding 61 opcodes with minimal state growth.
Scale — 761,621+ provably correct optimizations from a single ISA, with GPU search enabling complete coverage of length-3 sequences (74.8 billion targets). This is significantly more rules than prior peephole superoptimizer work has produced.

Whether this constitutes a publishable contribution depends on the venue — a workshop paper at CGO, CC, or a retro-computing venue like VCFW could work. The GPU-accelerated QuickCheck technique generalizes to any small-state ISA (6502, 8080, ARM Thumb subset, RISC-V compressed).

Is it patentable?

Superoptimization itself is well-established (1987+), and GPU compute is standard. The specific combination — GPU-parallel QuickCheck with dead-flags tagging for retro ISAs — is likely too incremental for a utility patent. More importantly, keeping it open source under MIT maximizes impact. Anyone optimizing Z80/6502/8080 code can use the technique and the 743K+ rules directly.

Beyond Peephole: Register Allocation, Multiply, Division

The project has expanded from instruction-level peephole optimization to three additional exhaustive search domains:

Exhaustive Register Allocation Tables

GPU brute-force over all possible register allocation constraint shapes. For each combination of virtual register count, widths, location sets, and interference graph, find the provably optimal physical register assignment — or prove that no valid assignment exists.

Level	Shapes	GPU time	Feasible
≤4 vregs	156,506	40 sec	78.9%
≤5 vregs	17,366,874	20 min	67.7%
6 vregs (dense, tw≥4)	66,118,738	~6 hours	TBD

Key findings:

Feasibility phase transition: 95.9% feasible at 2 vregs → 0.9% at 6 vregs. The Z80 register file "fills up" — 99.1% of all possible 6-vreg constraint shapes have no valid assignment.
Treewidth analysis: 99.5% of random interference graphs have treewidth ≤3 (classically tractable). But compiler-generated graphs are denser — 53.7% of real-world dense functions have treewidth ≥4.
Composition verification: Splitting 5-vreg shapes at cut vertices and composing from the 4-vreg table works with max 12 T-state overhead. Verified on 13.2M shapes with zero misses.
Five-level pipeline: Table lookup (O(1)) → graph decomposition → GPU brute-force → CPU backtracking (1000-4000x pruning) → island decomposition + Z3. Every function hits one of these levels.

Tables are in data/ as compact binary + zstd compression (8.5MB for ≤5v). See data/README.md for the binary format spec and reader examples.

# Generate all possible constraint shapes for ≤5 virtual registers
./regalloc-enum --max-vregs 5 > shapes.jsonl

# GPU-solve each shape (finds optimal assignment or proves infeasible)
cat shapes.jsonl | cuda/z80_regalloc --server > results.jsonl

# For 6v: only enumerate dense shapes (treewidth ≥ 4) — 1.7% of 6v space
./regalloc-enum --max-vregs 6 --only-nv 6 --dense-masks dense_6v_masks.txt > shapes_6v.jsonl

Optimal Constant Multiplication

GPU search for the shortest Z80 instruction sequence that computes A × K for each constant K (2-255).

Search depth	Constants solved	Pool	GPU time
≤8 instructions	103 / 254	21 ops	~30 sec
≤9 instructions	164 / 254	14 ops (reduced)	~3 min
≤10 instructions	249 / 254	14 ops	~15 hours
≤11 instructions	254 / 254	14 ops	~48 hours

All 254 constants solved by direct GPU search — no composition needed. Last 5 constants (170, 171, 173, 179, 181) found at length 11. Every multiply sequence is provably optimal.

Instruction pool reduction: Analysis of which instructions appear in optimal solutions revealed that 7 of 21 candidate ops are never used. Removing them shrinks the search space by 38x at length 9 (21^9 → 14^9). The removed ops are strictly dominated (e.g., SLA A = ADD A,A but costs 8T instead of 4T) or never contribute to optimal sequences (OR A, SCF, EX AF,AF').

Notable results:

×255 = NEG (1 instruction, 8T) — negate is multiply by -1 (mod 256)
×252 = RLA : NEG : ADD A,A (3 instructions, 16T)
×127 = RRCA : LD B,A : ADC A,B : NEG : RRA : SUB B (6 instructions, 28T)

# Find optimal multiply for all constants (14-op reduced pool)
cuda/z80_mulopt_fast --max-len 9 --json > mulopt_results.json

# Search for a specific constant
cuda/z80_mulopt_fast --k 42 --max-len 10

Division and Modulo — COMPLETE (254/254)

Analytical multiply-and-shift: A÷K = (A × M) >> S composed from mul16 table.

254/254 divisors solved via 6 methods (v3): shift(5), mul_shift(30), preshift_mul(36), mul_add256_shift(41), double_mul_shift(15), carry_compare(127)
T-states: 8–188 (avg 79T). Beats SDCC generic __div8 (80–200T). All exhaustively verified, 4× cross-verified.
carry_compare (K≥128): OR A; LD B,(256-K); ADC A,B; SBC A,A; AND 1 = 5 ops, 26T — GPU-discovered, not in any Z80 reference
PRESHIFT (K<128): (A>>P)×M>>S — exploit K's power-of-2 factor. div86 = 60T (was 170T)
div3 = A×171>>9 (16 ops, 141T). div10 = (A>>1)×103>>9 (15 ops, 138T). div7 = A×293>>11 (16 ops, 128T).
mod8 and divmod8 tables also complete (254/254 each).
GPU brute-force confirmed: div3/div10 not findable at max-len=8 (need 9+ instructions = beyond brute-force window).

Division by powers of 2 is trivial (SRL A repeated). Modulo by powers of 2 is trivial (AND mask).

# Search for division by constant
cuda/z80_divmod_fast --div 10 --max-len 10 --json

# Search for modulo by constant
cuda/z80_divmod_fast --mod 7 --max-len 10 --json

# Search for both quotient and remainder
cuda/z80_divmod_fast --divmod 10 --max-len 10 --json

Hardware Cluster

Three GPUs across three machines, running searches in parallel:

Machine	GPU	VRAM	API	Use
main	2× RTX 4060 Ti	16GB each	CUDA 12.0	Regalloc tables, peephole search
secondary	RTX 2070	8GB	CUDA 12.0	Mulopt, divmod search
AMD node	Radeon RX 580	8GB	OpenCL + Vulkan	Cross-vendor verification
Apple M2	M2 (10-core GPU)	24GB unified	Metal 3	CPU baseline, Metal cross-verification

Research Documents

docs/glossary.md — Complete glossary of all terms and abbreviations used in the project
docs/paper_seed_superopt.md — Paper/book seed: instruction pool reduction, register expansion analysis, meet-in-the-middle strategies, cross-architecture transfer, phase transitions
docs/research_statement.md — Research framing with phase diagram data
data/README.md — Binary table format spec with Python/Go reader examples

References

Massalin 1987 — Superoptimizer: A Look at the Smallest Program. ASPLOS '87.
Bansal & Aiken 2006 — Automatic Generation of Peephole Superoptimizers. ASPLOS '06.
STOKE 2013 — Stochastic Superoptimization. ASPLOS '13.
Lens 2016 — Scaling up Superoptimization. ASPLOS '16.
remogatto/z80 — Z80 emulator (flag behavior reference).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
cmd		cmd
contexts		contexts
cuda		cuda
data		data
docs		docs
media		media
metal		metal
opencl		opencl
pkg		pkg
release		release
scripts		scripts
vulkan		vulkan
.gitignore		.gitignore
BRUTEFORCE_ROADMAP.md		BRUTEFORCE_ROADMAP.md
CLAUDE.md		CLAUDE.md
README.md		README.md
TODO.md		TODO.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

z80-optimizer

What's New (Birthday Marathon — March 26–29, 2026)

Week 1 Highlights

Day-by-Day

pRNG Image Search — ZX Spectrum Demoscene

v1.0.0 Foundation (March 26)

Why?

How it works

1. QuickCheck (GPU)

2. MidCheck (GPU)

3. ExhaustiveCheck (GPU)

4. Pruning

GPU acceleration (CUDA)

Architecture (v2 batched pipeline)

Performance

False positive pipeline

Building and running

Verified correctness

STOKE stochastic search

Dead-flags optimization

Results

Length-2 brute force: 743,309 optimizations (96% complete)

Instruction coverage

Memory model (Wave 5)

Future extensions

Search cores

Correctness

Usage

Project structure

The search space

What's next

In progress

Planned

Bruteforce beyond peephole

Research context

What's novel here

Is it patentable?

Beyond Peephole: Register Allocation, Multiply, Division

Exhaustive Register Allocation Tables

Optimal Constant Multiplication

Division and Modulo — COMPLETE (254/254)

Hardware Cluster

Research Documents

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages