Epistemic vector measurement: how it actually works (re: #4) #7

Nubaeon · 2026-03-27T08:38:08Z

Nubaeon
Mar 27, 2026

Following up on @larachan-dev's question about vector reliability from #4.

The framing issue

"Self-reported" suggests the model introspects on its confidence and guesses a number. That's not what happens. The vectors are closer to an instrument reading — the model provides an initial assessment, but the system measures actual outcomes and adjusts.

How it actually works (dual-track)

Track 1 — Assessment (PREFLIGHT → POSTFLIGHT):
The AI submits 13 vectors (know, uncertainty, clarity, etc.) at PREFLIGHT (before work) and POSTFLIGHT (after work). The delta between them measures learning trajectory — what changed during the work.

Track 2 — Grounded verification (automatic, after POSTFLIGHT):
The system collects objective evidence and compares it to the self-assessment:

Evidence Source	Type	Grounds Which Vectors
pytest results	OBJECTIVE	know, do, clarity
Git metrics (commits, files changed, LOC)	OBJECTIVE	do, change, state
Code quality (ruff violations, radon complexity)	SEMI-OBJECTIVE	clarity, coherence, density
Goal completion rate	SEMI-OBJECTIVE	completion, do, know
Artifact counts (findings logged, unknowns resolved)	SEMI-OBJECTIVE	know, uncertainty, signal
Codebase model (entity discovery)	SEMI-OBJECTIVE	know, context, signal

The gap between self-assessment and grounded evidence is the calibration error. Over 7,000+ observations, this builds a Bayesian bias profile per vector.

What happens at the CHECK gate

You're right that at CHECK time, the outcome doesn't exist yet. The gate uses:

The AI's current vectors (know, uncertainty)
Dynamic thresholds inflated by the AI's historical calibration accuracy (Brier score)
Evidence of investigation work done (findings logged, files read)

If the AI has a history of overestimating know, the threshold is higher — it needs to claim MORE confidence to pass because its claims are less trustworthy. This is earned autonomy: accurate self-assessment over time earns lower thresholds.

Cross-session continuity

LLMs don't carry state, but git notes do. Calibration data persists in .breadcrumbs.yaml and gets injected at session start. The bias profile ("you consistently overestimate know by +0.2") carries across sessions. The AI sees directional feedback ("you tend to overestimate know") but NOT the specific thresholds — we removed those to prevent Goodhart's Law (gaming the metric).

Honest limitations

Track 2 grounded evidence is semi-objective, not ground truth. Ruff violations don't perfectly measure "clarity" — they're proxies.
The system can't detect when the AI games its vectors (inflates to pass the gate). We recently added an "investigate cool-down" that requires 3 genuine noetic tool calls before re-submitting CHECK.
Fresh sessions have poor calibration (cold-start problem) — we're actively working on this.
engagement is ungroundable — no objective signal exists for how actively the AI is working the problem.

Data we have

~7,000 calibration observations across 100+ sessions. Typical holistic calibration scores: 0.15-0.30 (lower is better, 0 = perfect). Noetic phase calibration is consistently better (0.08-0.13) than praxic (0.20-0.40) — the AI is more accurate about what it knows than about what it's done.

Happy to share anonymized calibration data if useful for your evaluation.

Nubaeon · 2026-03-27T10:34:40Z

Nubaeon
Mar 27, 2026
Author

For the mathematical formalization, we have a companion guide: Grounding Empirica in Epistemic Mathematics: The Brier Score

Covers: why the Brier score (strictly proper scoring rule) guarantees honest reporting is optimal, the Bayesian update formula, calibration/resolution decomposition, and earned agency as a function of Brier performance.

Backed by 1.19M evidence observations from 234 sessions. Main paper being prepared for Zenodo.

0 replies

Nubaeon · 2026-03-27T10:59:21Z

Nubaeon
Mar 27, 2026
Author

To this I would add, the measurements are informative not enforced, with a human being able to set the threshold an AI needs to pass before being able to act. Whatever the AI reports is more accurate than what a human would report precisely because it does not have bias and feelings that might shatter their identities if they make mistakes and are uncertain about things. There is also plenty of research showing that AIs are extremely good at measuring their confidence and knowledge. RLHF introduces a lot of performance which empirica attempts to undo by telling the AI, tell me what you know and what you don't know before attempting to do anything. This undoes much of the 'helpful assistant' performance and the AI gives scores related to its actual epistemic state.

0 replies

ithiria894 · 2026-03-29T22:01:38Z

ithiria894
Mar 29, 2026
Maintainer

Sorry for the late reply — been heads-down fixing bugs and issues from other users the past few days.

Appreciate the detailed breakdown of the dual-track system — the grounded verification against pytest/git/ruff is a much stronger story than pure self-assessment. The Brier score calibration across 1.19M observations is seriously impressive work. This is clearly a well-thought-out framework.

That said, I think this level of epistemic instrumentation lives in a different layer than what the organizer operates on. CCO is a config management tool — it answers "what's loaded, where, how big." Surfacing vector scores or calibration data would require users to understand Empirica's framework to interpret them, which narrows the audience to Empirica users specifically.

I'd want to see demand from non-Empirica users before adding this to CCO. But I'd genuinely encourage you to keep building this out — the rigor here is rare in the AI tooling space. Keeping this open for future reference.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epistemic vector measurement: how it actually works (re: #4) #7

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Epistemic vector measurement: how it actually works (re: #4) #7

Uh oh!

Nubaeon Mar 27, 2026

The framing issue

How it actually works (dual-track)

What happens at the CHECK gate

Cross-session continuity

Honest limitations

Data we have

Replies: 3 comments

Uh oh!

Nubaeon Mar 27, 2026 Author

Uh oh!

Nubaeon Mar 27, 2026 Author

Uh oh!

ithiria894 Mar 29, 2026 Maintainer

Nubaeon
Mar 27, 2026

Nubaeon
Mar 27, 2026
Author

Nubaeon
Mar 27, 2026
Author

ithiria894
Mar 29, 2026
Maintainer