Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB by Robby955 · Pull Request #900 · openai/parameter-golf

Robby955 · 2026-03-26T22:07:10Z

Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.11559 BPB

val_bpb: 0.11559 (3-seed mean, std 3.8e-6) | ~14.9 MB | 8xH100 SXM

What changed from previous version (0.11807)

Per-order concentration learning via Bayesian Online Concentration Learning (OBCL). Instead of a single c=5.0 for all n-gram orders, each order gets its own concentration learned from a posterior over a 50-point log-spaced grid [0.5, 50.0]:

Orders	Learned c	Interpretation
2-3 (bigram, trigram)	~50.0	Low orders noisy, need heavy neural prior
4	6.95	Transitional
5	2.98	Moderate evidence
6-8	~2.05	More specific matches, trust counts
9-14	~1.86	High-order matches precise, minimal smoothing

This 27x spread in optimal concentration across orders is explained by the exponential decrease in hash collision rate with increasing match length.

3-seed validation

Seed	BPB	Artifact	Eval time
1337	0.11559293	14,926,561	448s
2024	0.11559195	14,869,557	441s
2025	0.11558566	14,814,921	429s
Mean	0.11559 (std 3.8e-6)

Approach

Same Dirichlet-Multinomial formula at every level:

p(token) = (count + c_k * prior) / (total + c_k)

Level 1: 15-gram recursive backoff with per-order concentrations (OBCL-learned)
Level 2: Phrase suffix matching (probes at 20, 16 tokens) with c_phrase=1.0
Base measure: Neural LM (EBLS: 3 shared blocks x 3 loops + 2 unique = 11 layers)

Key ablations

Config	BPB	Delta
Neural only (EBLS + GPTQ)	1.1745	baseline
+ 15-gram Dirichlet backoff (flat c=5.0)	0.2292	-0.945
+ phrase Dirichlet (c=1.0)	0.1181	-0.111
+ per-order OBCL concentrations	0.1156	-0.002
Phrase with linear interp instead of Dirichlet	1.0686	8.9x worse

All ablation deltas exceed 200 sigma (3-seed std 3.8e-6).

Compliance

Training: 560s on 8xH100 (within 600s)
Eval: 448s worst case (within 600s)
Artifact: 14,926,561 bytes worst case (within 16,000,000)
Single-pass, strictly backward-looking, no training data at eval
No oracle/min(NLL) selection

Legality

N-gram caching ruled "directionally legal" by @valerio-oai (Issue #677). Single-pass, score-first, causal. We also maintain a separate neural-only submission (PR #734, 1.1198 BPB).

See README.md for full details, concentration landscapes, compression theory connection, and credits.

3-seed validated (std 0.000003): s1337: 0.11967683 (435s eval, 14.91MB) s2024: 0.11968156 (455s eval, 14.84MB) s2025: 0.11967545 (441s eval, 14.80MB) Dirichlet-Multinomial posterior predictive applied at two levels: - N-gram backoff (orders 2-15, c=5.0) - Phrase suffix matching (probes=[20,16], c=2.0) Ablation: removing Dirichlet from phrase mixing degrades BPB 8.9x.

….5e-6) - Optimized phrase-level concentration from 2.0 to 1.0 via sweep - Added phrase concentration landscape table (convex, min at 1.0) - Expanded compression theory section (CTW connection, match-length scaling, OBCL decomposition) - Updated 3-seed results: s1337=0.11807, s2024=0.11807, s2025=0.11806 - Longer matches need less smoothing: c* decreases from ~50 (bigrams) to 1.0 (phrases)

Log files now match claimed BPB (0.11807). All numbers are exact from verified pod runs, not approximations.

Per-order concentrations learned via Bayesian Online Concentration Learning range from 50.0 (bigrams) to 1.86 (14-grams). Improves from 0.11807 to 0.11559 (-0.00248 BPB). 3-seed mean 0.11559, std 3.8e-6.

Eppie · 2026-03-27T04:13:05Z

Moving discussion here so as to not clutter the other thread. It's late where I live and I'm a bit tired, so I had Claude write my response for me:

Your theoretical argument is correct — the Dirichlet-Multinomial posterior predictive produces a valid distribution when Σ_y count_y = N. The issue is that with hash tables, this identity doesn't hold, and the deviation is not negligible.

I implemented your exact formula (orders 2-15, per-order concentrations [50.0, 50.0, 6.95, 2.98, 2.05, 2.05, 2.05, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86], phrase probes at 20 and 16, 4M n-gram buckets, 1M phrase buckets, phrase concentration 1.0) and computed p(t) for all 1024 vocab tokens at sample positions as the cache fills:

chunk	tokens_seen	avg_p_sum	min_p_sum	max_p_sum
0	0	1.0000	1.0000	1.0000
1	131k	16.8850	1.3289	67.3684
2	262k	80.8645	2.2301	140.4559
5	655k	271.2090	55.8939	380.3625
9	1.18M	438.9720	215.0402	546.1090
25	3.28M	767.6675	555.1482	875.6750
49	6.42M	818.1985	583.1743	967.0769

These should all be exactly 1.0. The choice of prior (uniform vs neural softmax) doesn't affect whether the sum equals 1 — that depends entirely on whether Σ_y count_y = N, which is what the hash collisions break.

With 4M buckets, each of the 1024 lookups full_table[hash(ctx, t)] independently picks up counts from unrelated n-grams that collided into the same bucket. Most of these don't exceed ctx_table[hash(ctx)] so the min() clipping doesn't help, and they sum to far more than N.

You can verify this yourself — at any position after warmup, run your full hierarchical Dirichlet update (all orders + phrase) for all 1024 vocab tokens instead of just the correct one, and print sm_p.sum().

Me again: I just saw your arithmetic encoder. It uses exact counts via defaultdict(), which is different from what is implemented in this PR.

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Robby955 added 2 commits March 26, 2026 19:29

Replace logs with pc=1.0 runs, fix exact artifact sizes and eval times

3a9cd5c

Log files now match claimed BPB (0.11807). All numbers are exact from verified pod runs, not approximations.

Robby955 changed the title ~~Record: Two-Level Dirichlet Posterior Mixing — 0.11968 BPB~~ Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB Mar 27, 2026

Update to per-order OBCL concentrations: 0.11559 BPB (3-seed validated)

1190eeb

Per-order concentrations learned via Bayesian Online Concentration Learning range from 50.0 (bigrams) to 1.86 (14-grams). Improves from 0.11807 to 0.11559 (-0.00248 BPB). 3-seed mean 0.11559, std 3.8e-6.

Robby955 changed the title ~~Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB~~ Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB Mar 27, 2026

Eppie mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/dirichlet-phrase-0.1197

Robby955 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Eppie commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Robby955 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.11559 BPB

What changed from previous version (0.11807)

3-seed validation

Approach

Key ablations

Compliance

Legality

Uh oh!

Eppie commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Robby955 commented Mar 26, 2026 •

edited

Loading

Eppie commented Mar 27, 2026 •

edited

Loading