29 Mar 07:27

oisee

97f04d9

v1.3.0 — Paper v2.2, IX/IY halves, Multi-Platform DSL, Image Search Latest

Latest

v1.3.0 Release

Paper: "Register Allocation as a Solved Game" v2.2

regalloc_paper.pdf (A4) — 1600+ lines, 15 chapters, 12 Mermaid diagrams
regalloc_paper_a5.pdf (A5) — for tablets/phones
regalloc_paper.epub — with rendered diagrams (700KB)
NEW: Clear data breakdown table (83.6M → 37.6M → 78MB)
NEW: Multi-platform GPU section (ISA DSL → CUDA/Vulkan/Metal/OpenCL)
NEW: Corpus validation (820 functions, 246 signatures)

IX/IY Half Registers Declared Production-Safe

IXH, IXL, IYH, IYL work on all known Z80 silicon. 11 registers for allocation.
H↔IXH transfer via EX DE,HL trick (16T). IX/IY = bridges between main/shadow banks.

Corpus Bias Discovery

VIR corpus (820 funcs): 79% ≤6v — biased toward small functions
FatFS analysis: 16 functions, ALL >10v (est. 10-35 vregs)
Greedy partition for v10-v32 in <30ms each

pRNG Image Search Gallery

CNN-guided search: 7 targets × top-10 seeds (160 images)
Dithered target with VGG perceptual loss (Introspec's BB approach)
6 synthetic targets: skull, heart, star, smiley, tree, fish
ISA DSL: single source → 4 GPU backends, 3 vendors

Also included

chronicle.epub (862 lines, 11 chapters, days 1-3)
z80_compiler_never_guesses.epub/pdf (original article)

Full changelog: v1.2.0...v1.3.0

Assets 8

28 Mar 23:08

oisee

v1.2.0

5e1cf35

v1.2.0 — Day 3: Enriched Tables, O(1) Regalloc, Branchless Library

Day 3 Birthday Marathon Release

New: Enriched Register Allocation Tables (37.6M shapes)

data/enriched_{4v,5v,6v_dense}.enr.zst — 78MB compressed
15 operation-aware cost metrics per shape
43% of "feasible" shapes lack accumulator (hidden ALU infeasibility!)
Smart CALL save: 17T avg (was 34T) = 50% reduction
Width-aware: u8/u16/u32 scoring

New: O(1) Regalloc Architecture

Signature: (interference_shape, operation_bag) → hash lookup
91% of functions resolve in O(1), 8% GPU partition, 1% Z3
Validated on 820-function production compiler corpus
246 unique signatures (prediction <500 confirmed)

New: GPU Partition Optimizer

Brute-force optimal graph splits for 7-20v functions
All 172 corpus functions (7v+) partitioned optimally
14v in 0.7s, 19v in 7min, 20v in 30min

New: Paper v2 — "Register Allocation as a Solved Game"

regalloc_paper.pdf — 1574 lines, 15 chapters, 12 Mermaid diagrams
Updated with corpus validation and partition results
regalloc_paper.epub — mobile-friendly version

New: Branchless Library (exhaustive verified)

ABS (6i,24T), MIN/MAX (8i,32T), CMOV (6i,24T)
div3 EXACT: A×171>>9 (zero error, no lookup table!)
Z flag WRITE-ONLY proof (CY = only viable branchless bool)
SBC A,A mask trick = foundation of all branchless Z80

New: Focused Search + Vulkan

gray_decode EXACT (13 ops, <1s on Vulkan RX 580)
sqr_hi ±29, cbrt ±16, sin_q1 ±68
Per-target minimal op pools: fewer ops → deeper search

Updated: Chronicle

chronicle.epub — 862 lines, 11 chapters, days 1-3

Also included (from v1.0.0/v1.1.0)

z80_compiler_never_guesses.epub/pdf/html — original article

Full changelog: v1.1.0...v1.2.0

Assets 9

27 Mar 22:53

oisee

v1.1.0

57e738f

v1.1.0 — Day 2: FP16, BCD, Multi-Target GPU Search & Chronicle

Day 2 Birthday Marathon Release

New: Chronicle (EPUB)

chronicle.epub — "Книга о том, как мы за два дня доказали всю арифметику Z80" (ru)
770 lines, 10 chapters — narrative of the entire birthday marathon

New: FP16/BCD Library

pkg/fp16/ — FP16/FP24 floating point + BCD arithmetic for Z80
BCD H-flag fix: unlocked BCD×2 = ADD A,A; DAA (2 ops, 8T)

New: Multi-Target GPU Search

One GPU pass tests 15 non-linear functions simultaneously
13-op trimmed pool (12× faster than 16-op)
3-way GPU split across cluster
Targets: sin, cos, sqrt, log2, gamma, smoothstep, triangle...

New: 21-Op Universal Kernel

Minimum instruction set for ALL optimal Z80 arithmetic
Only 2.7% of ISA generates every known optimal sequence

New: Analytics

Cross-table analysis of 755 optimal sequences
"What 755 Optimal Sequences Teach Us About Computation"

Data Enrichment

mul16 table: clobber_mask added (all 254 preserve A, all DE-safe)
16-bit arithmetic idioms + BCD idioms (GPU-proven)

Also included

z80_compiler_never_guesses.epub/pdf/html — original article from v1.0.0

Full changelog: v1.0.0...v1.1.0

Assets 8

26 Mar 21:24

oisee

v1.0.0

3ebce66

v1.0.0 — Birthday Release 🎂

z80-optimizer v1.0.0 — Birthday Release 🎂

GPU Brute-Force Superoptimizer for Z80

The first Z80 compiler toolchain that never guesses — every optimization is provably optimal, verified by exhaustive GPU search.

Highlights

372 Provably Optimal Arithmetic Sequences

254/254 constant multiplies (u8, avg 8× faster than shift-and-add)
254/254 16-bit multiplies (u8×K→u16, complete in 30 seconds)
118/120 constant divisions (u8/K, avg 2.5× faster than general loop)
15 branchless idioms (abs, bool, sign, not, lsb, ...)

83.6 Million Register Allocations

Complete exhaustive table for ≤6 virtual registers (32MB compressed)
Feasibility phase transition: 96% feasible at 2v → 0.9% at 6v
Five-level pipeline: O(1) table → composition → GPU → backtracking → Z3

739K Peephole Rules

Every length-2 instruction pair tested against all possible inputs
SLA A : RR A → OR A (save 12T, 3 bytes) and 738,999 more

4-GPU, 3-Vendor Cluster

NVIDIA RTX 4060 Ti × 2 (CUDA)
NVIDIA RTX 2070 (CUDA)
AMD Radeon RX 580 (OpenCL + Vulkan via Mesa)
Apple M2 (Metal)
All results cross-verified across platforms

Multi-Backend GPU DSL

pkg/gpugen/: one ISA definition → CUDA / Metal / OpenCL / Vulkan
Z80 and 6502 ISA definitions included
--mask / --disable for op pool pruning

Key Discoveries

Discovery	Impact
Pool reduction 21→3 ops	13,600× speedup for u16 multiply
Guided brute-force	Abstract chains → 6-op GPU = div3 in 11 sec
Prefix sharing	51-86% code compression for multiply library
Branchless ABS(A)	6 insts, 24T — no branch penalty
Sign-extend in 3 insts	ADC overflow → SBC mask (12T)
NEG HL via EX DE,HL	3 insts, 23T (improved from 5/28T)
Feasibility cliff at 6v	99.1% of constraint shapes are impossible
Treewidth analysis	Random≠real: compiler graphs are denser

Data Files (data/)

File	Entries	Size
exhaustive_4v.bin.zst	156K regalloc shapes	64KB
exhaustive_5v.bin.zst	17.4M regalloc shapes	8.5MB
exhaustive_6v_dense.bin.zst	66.1M regalloc shapes	22.5MB
mulopt8_clobber.json	164 multiply sequences	with B-clobber info
mulopt16_complete.json	254 u16 multiply sequences	3-op basis
div8_optimal.json	118 division sequences	guided brute-force
peephole_top500.json	top 500 peephole rules	by cycle savings
chains_mul8_mod256.json	254 abstract multiply chains	ISA-independent
chains_div8.json	86 abstract division chains	compose from mul
mul8_library.asm	Z80 assembly library	prefix-shared

Go Packages

import "github.com/oisee/z80-optimizer/pkg/mulopt"   // Emit8, Emit16, Lookup
import "github.com/oisee/z80-optimizer/pkg/regalloc"  // LoadBinary, IndexOf
import "github.com/oisee/z80-optimizer/pkg/peephole"  // Lookup, LoadRules
import "github.com/oisee/z80-optimizer/pkg/gpugen"    // ISA DSL → 4 backends

Documentation

docs/glossary.md — complete term reference
docs/insights.md — experiment log with discoveries
docs/paper_seed_superopt.md — paper/book seed (8 sections)
docs/sprint_129.md — sprint plan
docs/book_outline.md — book outline (19 chapters)
docs/gpugen.md — multi-backend DSL guide
docs/m2_onboarding.md — new machine setup
data/README.md — binary format spec + Python/Go readers

The Birthday Team 🎂

Built in one marathon session across 5 machines, 4 APIs, 3 GPU vendors, and multiple Claude sessions collaborating via ddll.

Packed Multiply/Division Library

The 254 multiply and 118 division sequences share instruction prefixes extensively. The packed library uses multiple entry points via labels — each constant starts at its label and falls through shared code to a common RET:

mul104:                    ; entry for ×104
mul52:  ADD A,A            ; entry for ×52 (×104 = ×52 × 2)
mul26:  ADD A,A            ; entry for ×26
mul24:  ADD A,B            ; entry for ×24
mul12:  ADD A,A            ; entry for ×12
mul6:   LD B,A             ; entry for ×6
        ADD A,B
        ADD A,B
mul2:   RLA                ; entry for ×2
        RET                ; shared return — 7 constants, 9 instructions!

mul8 library: 594 bytes for 164 constants (51% compression)
mul16 library: ~500 bytes for 254 constants (86% compression)
Combined: ~1.1KB for ALL optimal multiplies = 2.2% of ZX Spectrum 48KB

For division: same principle — reciprocal multiply phases overlap between divisors sharing the same magic constant M.

Runtime dispatch: page-aligned jump table (256 bytes) + packed code.
LD H, page / LD L, K / JP (HL) = single-instruction dispatch to optimal sequence.

Assets 6

28 Feb 15:24

oisee

v0.1.0

0b2de25

v0.1.0 — 602,008 Z80 optimizations found

What is this?

A brute-force superoptimizer for the Zilog Z80 processor.

Given any sequence of Z80 instructions, it exhaustively searches for a shorter equivalent — one that produces identical register and flag state for every possible input. No heuristics, no pattern databases, just brute force with smart pruning.

First run results

602,008 provably correct optimizations found from 8.4M length-2 target sequences.

34.7 billion comparisons checked
3 hours 16 minutes on Apple M2
83 unique transformation patterns discovered
Results: rules.json.gz (3.4MB compressed, 102MB uncompressed)

Highlights

Original	Replacement	Savings	Insight
`SLA A : RR A`	`OR A`	-3B, -12T	Shift left then rotate right = identity + flag set
`SRL A : RL A`	`OR A`	-3B, -12T	Shift right then rotate left = identity + flag set
`AND 00h : NEG`	`SUB A`	-3B, -11T	Zero then negate = zero with subtract flags
`LD A, 00h : SLA A`	`XOR A`	-3B, -11T	Load zero then shift = zero with logic flags
`ADD A, A : RR A`	`OR A`	-2B, -8T	Double then halve = identity + flag set
`SRL A : SLL A`	`OR 01h`	-2B, -9T	Shift right then undoc-shift-left = set bit 0
`XOR 0FFh : NEG`	`SUB 0FFh`	-2B, -8T	Complement then negate
`CPL : NEG`	`SUB 0FFh`	-1B, -5T	Complement then negate = increment with sub flags
`XOR 0FFh : SBC A, 0FFh`	`NEG`	-2B, -6T	Complement + subtract-with-borrow = negate
`SCF : RL A`	`SLL A`	-1B, -4T	Set carry then rotate-through-carry = undoc shift
`SCF : ADC A, 00h`	`ADD A, 01h`	-1B, -4T	Set carry then add-with-carry zero = add 1
`SET 0, L : DEC HL`	`RES 0, L`	-1B, -6T	Set bit then decrement = just clear the bit
`RES 0, L : INC HL`	`SET 0, L`	-1B, -6T	Clear bit then increment = just set the bit
`ADD A, 80h : OR A`	`XOR 80h`	-1B, -4T	Flip sign bit with correct flag behavior
`ADD A, 00h : RL A`	`SLA A`	-2B, -7T	Clear carry then rotate = shift left
`AND 0FFh : RR A`	`SRL A`	-2B, -7T	Clear carry via AND then rotate = shift right

Breakdown by savings

Bytes saved	Count
3 bytes	1,212
2 bytes	580,937
1 byte	19,859

Note: it correctly rejects false optimizations like LD A, 0 → XOR A because XOR A modifies flags while LD A, 0 does not. Every rule is verified against all possible inputs.

What's included

406 Z80 opcodes covering all register-only operations:

8-bit loads, arithmetic, logic (ADD, ADC, SUB, SBC, AND, OR, XOR, CP)
All CB-prefix shifts and rotates (RLC, RRC, RL, RR, SLA, SRA, SRL, SLL)
BIT/RES/SET n, r — all 168 register variants
16-bit pair ops (INC/DEC rr, ADD HL, EX DE,HL, LD SP,HL)
16-bit immediate loads (LD BC/DE/HL/SP, nn)
ED-prefix arithmetic (ADC HL, SBC HL)
Specials: DAA, CPL, NEG, SCF, CCF, NOP

Usage

go build -o z80opt ./cmd/z80opt

# Find optimal replacement for a specific sequence
./z80opt target "AND 0xFF"
# → Replacement: AND A (-1 bytes, -3 cycles)

./z80opt target "SLA A : RR A"
# → Replacement: OR A (-3 bytes, -12 cycles)

# Run full enumeration
./z80opt enumerate --max-target 2 --output rules.json -v

Roadmap

See docs/GenPlan.md for the full implementation plan.

Next phases:

STOKE stochastic search — MCMC random mutations for length 5-10+ sequences
CUDA brute force — GPU-accelerated length-3 complete search (~20 min on 2× RTX 4060 Ti)
Reordering optimizer — dependency DAG analysis to apply rules to real code with interleaved instructions
More opcodes — (HL) memory ops, PUSH/POP, IX/IY indexed addressing

Assets 3

Releases: oisee/z80-optimizer

v1.3.0 — Paper v2.2, IX/IY halves, Multi-Platform DSL, Image Search

v1.3.0 Release

Paper: "Register Allocation as a Solved Game" v2.2

IX/IY Half Registers Declared Production-Safe

Corpus Bias Discovery

pRNG Image Search Gallery

Also included

Uh oh!

v1.2.0 — Day 3: Enriched Tables, O(1) Regalloc, Branchless Library

Day 3 Birthday Marathon Release

New: Enriched Register Allocation Tables (37.6M shapes)

New: O(1) Regalloc Architecture

New: GPU Partition Optimizer

New: Paper v2 — "Register Allocation as a Solved Game"

New: Branchless Library (exhaustive verified)

New: Focused Search + Vulkan

Updated: Chronicle

Also included (from v1.0.0/v1.1.0)

Uh oh!

v1.1.0 — Day 2: FP16, BCD, Multi-Target GPU Search & Chronicle

Day 2 Birthday Marathon Release

New: Chronicle (EPUB)

New: FP16/BCD Library

New: Multi-Target GPU Search

New: 21-Op Universal Kernel

New: Analytics

Data Enrichment

Also included

Uh oh!

v1.0.0 — Birthday Release 🎂

z80-optimizer v1.0.0 — Birthday Release 🎂

GPU Brute-Force Superoptimizer for Z80

Highlights

Key Discoveries

Data Files (data/)

Go Packages

Documentation

The Birthday Team 🎂

Packed Multiply/Division Library

Uh oh!

v0.1.0 — 602,008 Z80 optimizations found

What is this?

First run results

Highlights

Breakdown by savings

What's included

Usage

Roadmap

Uh oh!