Skip to content

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900

Open
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/dirichlet-phrase-0.1197
Open

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/dirichlet-phrase-0.1197

Conversation

@Robby955
Copy link

@Robby955 Robby955 commented Mar 26, 2026

Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.11559 BPB

val_bpb: 0.11559 (3-seed mean, std 3.8e-6) | ~14.9 MB | 8xH100 SXM

What changed from previous version (0.11807)

Per-order concentration learning via Bayesian Online Concentration Learning (OBCL). Instead of a single c=5.0 for all n-gram orders, each order gets its own concentration learned from a posterior over a 50-point log-spaced grid [0.5, 50.0]:

Orders Learned c Interpretation
2-3 (bigram, trigram) ~50.0 Low orders noisy, need heavy neural prior
4 6.95 Transitional
5 2.98 Moderate evidence
6-8 ~2.05 More specific matches, trust counts
9-14 ~1.86 High-order matches precise, minimal smoothing

This 27x spread in optimal concentration across orders is explained by the exponential decrease in hash collision rate with increasing match length.

3-seed validation

Seed BPB Artifact Eval time
1337 0.11559293 14,926,561 448s
2024 0.11559195 14,869,557 441s
2025 0.11558566 14,814,921 429s
Mean 0.11559 (std 3.8e-6)

Approach

Same Dirichlet-Multinomial formula at every level:

p(token) = (count + c_k * prior) / (total + c_k)
  • Level 1: 15-gram recursive backoff with per-order concentrations (OBCL-learned)
  • Level 2: Phrase suffix matching (probes at 20, 16 tokens) with c_phrase=1.0
  • Base measure: Neural LM (EBLS: 3 shared blocks x 3 loops + 2 unique = 11 layers)

Key ablations

Config BPB Delta
Neural only (EBLS + GPTQ) 1.1745 baseline
+ 15-gram Dirichlet backoff (flat c=5.0) 0.2292 -0.945
+ phrase Dirichlet (c=1.0) 0.1181 -0.111
+ per-order OBCL concentrations 0.1156 -0.002
Phrase with linear interp instead of Dirichlet 1.0686 8.9x worse

All ablation deltas exceed 200 sigma (3-seed std 3.8e-6).

Compliance

  • Training: 560s on 8xH100 (within 600s)
  • Eval: 448s worst case (within 600s)
  • Artifact: 14,926,561 bytes worst case (within 16,000,000)
  • Single-pass, strictly backward-looking, no training data at eval
  • No oracle/min(NLL) selection

Legality

N-gram caching ruled "directionally legal" by @valerio-oai (Issue #677). Single-pass, score-first, causal. We also maintain a separate neural-only submission (PR #734, 1.1198 BPB).

See README.md for full details, concentration landscapes, compression theory connection, and credits.

3-seed validated (std 0.000003):
  s1337: 0.11967683 (435s eval, 14.91MB)
  s2024: 0.11968156 (455s eval, 14.84MB)
  s2025: 0.11967545 (441s eval, 14.80MB)

Dirichlet-Multinomial posterior predictive applied at two levels:
- N-gram backoff (orders 2-15, c=5.0)
- Phrase suffix matching (probes=[20,16], c=2.0)

Ablation: removing Dirichlet from phrase mixing degrades BPB 8.9x.
….5e-6)

- Optimized phrase-level concentration from 2.0 to 1.0 via sweep
- Added phrase concentration landscape table (convex, min at 1.0)
- Expanded compression theory section (CTW connection, match-length scaling, OBCL decomposition)
- Updated 3-seed results: s1337=0.11807, s2024=0.11807, s2025=0.11806
- Longer matches need less smoothing: c* decreases from ~50 (bigrams) to 1.0 (phrases)
Log files now match claimed BPB (0.11807). All numbers are exact from
verified pod runs, not approximations.
@Robby955 Robby955 changed the title Record: Two-Level Dirichlet Posterior Mixing — 0.11968 BPB Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB Mar 27, 2026
Per-order concentrations learned via Bayesian Online Concentration Learning
range from 50.0 (bigrams) to 1.86 (14-grams). Improves from 0.11807 to
0.11559 (-0.00248 BPB). 3-seed mean 0.11559, std 3.8e-6.
@Robby955 Robby955 changed the title Record: Two-Level Dirichlet Posterior Mixing — 0.11807 BPB Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB Mar 27, 2026
@Eppie
Copy link

Eppie commented Mar 27, 2026

Moving discussion here so as to not clutter the other thread. It's late where I live and I'm a bit tired, so I had Claude write my response for me:


Your theoretical argument is correct — the Dirichlet-Multinomial posterior predictive produces a valid distribution when Σ_y count_y = N. The issue is that with hash tables, this identity doesn't hold, and the deviation is not negligible.

I implemented your exact formula (orders 2-15, per-order concentrations [50.0, 50.0, 6.95, 2.98, 2.05, 2.05, 2.05, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86], phrase probes at 20 and 16, 4M n-gram buckets, 1M phrase buckets, phrase concentration 1.0) and computed p(t) for all 1024 vocab tokens at sample positions as the cache fills:

chunk tokens_seen avg_p_sum min_p_sum max_p_sum
0 0 1.0000 1.0000 1.0000
1 131k 16.8850 1.3289 67.3684
2 262k 80.8645 2.2301 140.4559
5 655k 271.2090 55.8939 380.3625
9 1.18M 438.9720 215.0402 546.1090
25 3.28M 767.6675 555.1482 875.6750
49 6.42M 818.1985 583.1743 967.0769

These should all be exactly 1.0. The choice of prior (uniform vs neural softmax) doesn't affect whether the sum equals 1 — that depends entirely on whether Σ_y count_y = N, which is what the hash collisions break.

With 4M buckets, each of the 1024 lookups full_table[hash(ctx, t)] independently picks up counts from unrelated n-grams that collided into the same bucket. Most of these don't exceed ctx_table[hash(ctx)] so the min() clipping doesn't help, and they sum to far more than N.

You can verify this yourself — at any position after warmup, run your full hierarchical Dirichlet update (all orders + phrase) for all 1024 vocab tokens instead of just the correct one, and print sm_p.sum().


Me again: I just saw your arithmetic encoder. It uses exact counts via defaultdict(), which is different from what is implemented in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants