Skip to content

Releases: oisee/z80-optimizer

v1.3.0 — Paper v2.2, IX/IY halves, Multi-Platform DSL, Image Search

29 Mar 07:27

Choose a tag to compare

v1.3.0 Release

Paper: "Register Allocation as a Solved Game" v2.2

  • regalloc_paper.pdf (A4) — 1600+ lines, 15 chapters, 12 Mermaid diagrams
  • regalloc_paper_a5.pdf (A5) — for tablets/phones
  • regalloc_paper.epub — with rendered diagrams (700KB)
  • NEW: Clear data breakdown table (83.6M → 37.6M → 78MB)
  • NEW: Multi-platform GPU section (ISA DSL → CUDA/Vulkan/Metal/OpenCL)
  • NEW: Corpus validation (820 functions, 246 signatures)

IX/IY Half Registers Declared Production-Safe

IXH, IXL, IYH, IYL work on all known Z80 silicon. 11 registers for allocation.
H↔IXH transfer via EX DE,HL trick (16T). IX/IY = bridges between main/shadow banks.

Corpus Bias Discovery

  • VIR corpus (820 funcs): 79% ≤6v — biased toward small functions
  • FatFS analysis: 16 functions, ALL >10v (est. 10-35 vregs)
  • Greedy partition for v10-v32 in <30ms each

pRNG Image Search Gallery

  • CNN-guided search: 7 targets × top-10 seeds (160 images)
  • Dithered target with VGG perceptual loss (Introspec's BB approach)
  • 6 synthetic targets: skull, heart, star, smiley, tree, fish
  • ISA DSL: single source → 4 GPU backends, 3 vendors

Also included

  • chronicle.epub (862 lines, 11 chapters, days 1-3)
  • z80_compiler_never_guesses.epub/pdf (original article)

Full changelog: v1.2.0...v1.3.0

v1.2.0 — Day 3: Enriched Tables, O(1) Regalloc, Branchless Library

28 Mar 23:08

Choose a tag to compare

Day 3 Birthday Marathon Release

New: Enriched Register Allocation Tables (37.6M shapes)

  • data/enriched_{4v,5v,6v_dense}.enr.zst — 78MB compressed
  • 15 operation-aware cost metrics per shape
  • 43% of "feasible" shapes lack accumulator (hidden ALU infeasibility!)
  • Smart CALL save: 17T avg (was 34T) = 50% reduction
  • Width-aware: u8/u16/u32 scoring

New: O(1) Regalloc Architecture

  • Signature: (interference_shape, operation_bag) → hash lookup
  • 91% of functions resolve in O(1), 8% GPU partition, 1% Z3
  • Validated on 820-function production compiler corpus
  • 246 unique signatures (prediction <500 confirmed)

New: GPU Partition Optimizer

  • Brute-force optimal graph splits for 7-20v functions
  • All 172 corpus functions (7v+) partitioned optimally
  • 14v in 0.7s, 19v in 7min, 20v in 30min

New: Paper v2 — "Register Allocation as a Solved Game"

  • regalloc_paper.pdf — 1574 lines, 15 chapters, 12 Mermaid diagrams
  • Updated with corpus validation and partition results
  • regalloc_paper.epub — mobile-friendly version

New: Branchless Library (exhaustive verified)

  • ABS (6i,24T), MIN/MAX (8i,32T), CMOV (6i,24T)
  • div3 EXACT: A×171>>9 (zero error, no lookup table!)
  • Z flag WRITE-ONLY proof (CY = only viable branchless bool)
  • SBC A,A mask trick = foundation of all branchless Z80

New: Focused Search + Vulkan

  • gray_decode EXACT (13 ops, <1s on Vulkan RX 580)
  • sqr_hi ±29, cbrt ±16, sin_q1 ±68
  • Per-target minimal op pools: fewer ops → deeper search

Updated: Chronicle

  • chronicle.epub — 862 lines, 11 chapters, days 1-3

Also included (from v1.0.0/v1.1.0)

  • z80_compiler_never_guesses.epub/pdf/html — original article

Full changelog: v1.1.0...v1.2.0

v1.1.0 — Day 2: FP16, BCD, Multi-Target GPU Search & Chronicle

27 Mar 22:53

Choose a tag to compare

Day 2 Birthday Marathon Release

New: Chronicle (EPUB)

  • chronicle.epub — "Книга о том, как мы за два дня доказали всю арифметику Z80" (ru)
  • 770 lines, 10 chapters — narrative of the entire birthday marathon

New: FP16/BCD Library

  • pkg/fp16/ — FP16/FP24 floating point + BCD arithmetic for Z80
  • BCD H-flag fix: unlocked BCD×2 = ADD A,A; DAA (2 ops, 8T)

New: Multi-Target GPU Search

  • One GPU pass tests 15 non-linear functions simultaneously
  • 13-op trimmed pool (12× faster than 16-op)
  • 3-way GPU split across cluster
  • Targets: sin, cos, sqrt, log2, gamma, smoothstep, triangle...

New: 21-Op Universal Kernel

  • Minimum instruction set for ALL optimal Z80 arithmetic
  • Only 2.7% of ISA generates every known optimal sequence

New: Analytics

  • Cross-table analysis of 755 optimal sequences
  • "What 755 Optimal Sequences Teach Us About Computation"

Data Enrichment

  • mul16 table: clobber_mask added (all 254 preserve A, all DE-safe)
  • 16-bit arithmetic idioms + BCD idioms (GPU-proven)

Also included

  • z80_compiler_never_guesses.epub/pdf/html — original article from v1.0.0

Full changelog: v1.0.0...v1.1.0

v1.0.0 — Birthday Release 🎂

26 Mar 21:24

Choose a tag to compare

z80-optimizer v1.0.0 — Birthday Release 🎂

GPU Brute-Force Superoptimizer for Z80

The first Z80 compiler toolchain that never guesses — every optimization is provably optimal, verified by exhaustive GPU search.

Highlights

372 Provably Optimal Arithmetic Sequences

  • 254/254 constant multiplies (u8, avg 8× faster than shift-and-add)
  • 254/254 16-bit multiplies (u8×K→u16, complete in 30 seconds)
  • 118/120 constant divisions (u8/K, avg 2.5× faster than general loop)
  • 15 branchless idioms (abs, bool, sign, not, lsb, ...)

83.6 Million Register Allocations

  • Complete exhaustive table for ≤6 virtual registers (32MB compressed)
  • Feasibility phase transition: 96% feasible at 2v → 0.9% at 6v
  • Five-level pipeline: O(1) table → composition → GPU → backtracking → Z3

739K Peephole Rules

  • Every length-2 instruction pair tested against all possible inputs
  • SLA A : RR A → OR A (save 12T, 3 bytes) and 738,999 more

4-GPU, 3-Vendor Cluster

  • NVIDIA RTX 4060 Ti × 2 (CUDA)
  • NVIDIA RTX 2070 (CUDA)
  • AMD Radeon RX 580 (OpenCL + Vulkan via Mesa)
  • Apple M2 (Metal)
  • All results cross-verified across platforms

Multi-Backend GPU DSL

  • pkg/gpugen/: one ISA definition → CUDA / Metal / OpenCL / Vulkan
  • Z80 and 6502 ISA definitions included
  • --mask / --disable for op pool pruning

Key Discoveries

Discovery Impact
Pool reduction 21→3 ops 13,600× speedup for u16 multiply
Guided brute-force Abstract chains → 6-op GPU = div3 in 11 sec
Prefix sharing 51-86% code compression for multiply library
Branchless ABS(A) 6 insts, 24T — no branch penalty
Sign-extend in 3 insts ADC overflow → SBC mask (12T)
NEG HL via EX DE,HL 3 insts, 23T (improved from 5/28T)
Feasibility cliff at 6v 99.1% of constraint shapes are impossible
Treewidth analysis Random≠real: compiler graphs are denser

Data Files (data/)

File Entries Size
exhaustive_4v.bin.zst 156K regalloc shapes 64KB
exhaustive_5v.bin.zst 17.4M regalloc shapes 8.5MB
exhaustive_6v_dense.bin.zst 66.1M regalloc shapes 22.5MB
mulopt8_clobber.json 164 multiply sequences with B-clobber info
mulopt16_complete.json 254 u16 multiply sequences 3-op basis
div8_optimal.json 118 division sequences guided brute-force
peephole_top500.json top 500 peephole rules by cycle savings
chains_mul8_mod256.json 254 abstract multiply chains ISA-independent
chains_div8.json 86 abstract division chains compose from mul
mul8_library.asm Z80 assembly library prefix-shared

Go Packages

import "github.com/oisee/z80-optimizer/pkg/mulopt"   // Emit8, Emit16, Lookup
import "github.com/oisee/z80-optimizer/pkg/regalloc"  // LoadBinary, IndexOf
import "github.com/oisee/z80-optimizer/pkg/peephole"  // Lookup, LoadRules
import "github.com/oisee/z80-optimizer/pkg/gpugen"    // ISA DSL → 4 backends

Documentation

  • docs/glossary.md — complete term reference
  • docs/insights.md — experiment log with discoveries
  • docs/paper_seed_superopt.md — paper/book seed (8 sections)
  • docs/sprint_129.md — sprint plan
  • docs/book_outline.md — book outline (19 chapters)
  • docs/gpugen.md — multi-backend DSL guide
  • docs/m2_onboarding.md — new machine setup
  • data/README.md — binary format spec + Python/Go readers

The Birthday Team 🎂

Built in one marathon session across 5 machines, 4 APIs, 3 GPU vendors, and multiple Claude sessions collaborating via ddll.

Packed Multiply/Division Library

The 254 multiply and 118 division sequences share instruction prefixes extensively. The packed library uses multiple entry points via labels — each constant starts at its label and falls through shared code to a common RET:

mul104:                    ; entry for ×104
mul52:  ADD A,A            ; entry for ×52 (×104 = ×52 × 2)
mul26:  ADD A,A            ; entry for ×26
mul24:  ADD A,B            ; entry for ×24
mul12:  ADD A,A            ; entry for ×12
mul6:   LD B,A             ; entry for ×6
        ADD A,B
        ADD A,B
mul2:   RLA                ; entry for ×2
        RET                ; shared return — 7 constants, 9 instructions!

mul8 library: 594 bytes for 164 constants (51% compression)
mul16 library: ~500 bytes for 254 constants (86% compression)
Combined: ~1.1KB for ALL optimal multiplies = 2.2% of ZX Spectrum 48KB

For division: same principle — reciprocal multiply phases overlap between divisors sharing the same magic constant M.

Runtime dispatch: page-aligned jump table (256 bytes) + packed code.
LD H, page / LD L, K / JP (HL) = single-instruction dispatch to optimal sequence.

v0.1.0 — 602,008 Z80 optimizations found

28 Feb 15:24

Choose a tag to compare

What is this?

A brute-force superoptimizer for the Zilog Z80 processor.

Given any sequence of Z80 instructions, it exhaustively searches for a shorter equivalent — one that produces identical register and flag state for every possible input. No heuristics, no pattern databases, just brute force with smart pruning.

First run results

602,008 provably correct optimizations found from 8.4M length-2 target sequences.

  • 34.7 billion comparisons checked
  • 3 hours 16 minutes on Apple M2
  • 83 unique transformation patterns discovered
  • Results: rules.json.gz (3.4MB compressed, 102MB uncompressed)

Highlights

Original Replacement Savings Insight
SLA A : RR A OR A -3B, -12T Shift left then rotate right = identity + flag set
SRL A : RL A OR A -3B, -12T Shift right then rotate left = identity + flag set
AND 00h : NEG SUB A -3B, -11T Zero then negate = zero with subtract flags
LD A, 00h : SLA A XOR A -3B, -11T Load zero then shift = zero with logic flags
ADD A, A : RR A OR A -2B, -8T Double then halve = identity + flag set
SRL A : SLL A OR 01h -2B, -9T Shift right then undoc-shift-left = set bit 0
XOR 0FFh : NEG SUB 0FFh -2B, -8T Complement then negate
CPL : NEG SUB 0FFh -1B, -5T Complement then negate = increment with sub flags
XOR 0FFh : SBC A, 0FFh NEG -2B, -6T Complement + subtract-with-borrow = negate
SCF : RL A SLL A -1B, -4T Set carry then rotate-through-carry = undoc shift
SCF : ADC A, 00h ADD A, 01h -1B, -4T Set carry then add-with-carry zero = add 1
SET 0, L : DEC HL RES 0, L -1B, -6T Set bit then decrement = just clear the bit
RES 0, L : INC HL SET 0, L -1B, -6T Clear bit then increment = just set the bit
ADD A, 80h : OR A XOR 80h -1B, -4T Flip sign bit with correct flag behavior
ADD A, 00h : RL A SLA A -2B, -7T Clear carry then rotate = shift left
AND 0FFh : RR A SRL A -2B, -7T Clear carry via AND then rotate = shift right

Breakdown by savings

Bytes saved Count
3 bytes 1,212
2 bytes 580,937
1 byte 19,859

Note: it correctly rejects false optimizations like LD A, 0 → XOR A because XOR A modifies flags while LD A, 0 does not. Every rule is verified against all possible inputs.

What's included

406 Z80 opcodes covering all register-only operations:

  • 8-bit loads, arithmetic, logic (ADD, ADC, SUB, SBC, AND, OR, XOR, CP)
  • All CB-prefix shifts and rotates (RLC, RRC, RL, RR, SLA, SRA, SRL, SLL)
  • BIT/RES/SET n, r — all 168 register variants
  • 16-bit pair ops (INC/DEC rr, ADD HL, EX DE,HL, LD SP,HL)
  • 16-bit immediate loads (LD BC/DE/HL/SP, nn)
  • ED-prefix arithmetic (ADC HL, SBC HL)
  • Specials: DAA, CPL, NEG, SCF, CCF, NOP

Usage

go build -o z80opt ./cmd/z80opt

# Find optimal replacement for a specific sequence
./z80opt target "AND 0xFF"
# → Replacement: AND A (-1 bytes, -3 cycles)

./z80opt target "SLA A : RR A"
# → Replacement: OR A (-3 bytes, -12 cycles)

# Run full enumeration
./z80opt enumerate --max-target 2 --output rules.json -v

Roadmap

See docs/GenPlan.md for the full implementation plan.

Next phases:

  • STOKE stochastic search — MCMC random mutations for length 5-10+ sequences
  • CUDA brute force — GPU-accelerated length-3 complete search (~20 min on 2× RTX 4060 Ti)
  • Reordering optimizer — dependency DAG analysis to apply rules to real code with interleaved instructions
  • More opcodes — (HL) memory ops, PUSH/POP, IX/IY indexed addressing