Replies: 3 comments
-
|
For the mathematical formalization, we have a companion guide: Grounding Empirica in Epistemic Mathematics: The Brier Score Covers: why the Brier score (strictly proper scoring rule) guarantees honest reporting is optimal, the Bayesian update formula, calibration/resolution decomposition, and earned agency as a function of Brier performance. Backed by 1.19M evidence observations from 234 sessions. Main paper being prepared for Zenodo. |
Beta Was this translation helpful? Give feedback.
-
|
To this I would add, the measurements are informative not enforced, with a human being able to set the threshold an AI needs to pass before being able to act. Whatever the AI reports is more accurate than what a human would report precisely because it does not have bias and feelings that might shatter their identities if they make mistakes and are uncertain about things. There is also plenty of research showing that AIs are extremely good at measuring their confidence and knowledge. RLHF introduces a lot of performance which empirica attempts to undo by telling the AI, tell me what you know and what you don't know before attempting to do anything. This undoes much of the 'helpful assistant' performance and the AI gives scores related to its actual epistemic state. |
Beta Was this translation helpful? Give feedback.
-
|
Sorry for the late reply — been heads-down fixing bugs and issues from other users the past few days. Appreciate the detailed breakdown of the dual-track system — the grounded verification against pytest/git/ruff is a much stronger story than pure self-assessment. The Brier score calibration across 1.19M observations is seriously impressive work. This is clearly a well-thought-out framework. That said, I think this level of epistemic instrumentation lives in a different layer than what the organizer operates on. CCO is a config management tool — it answers "what's loaded, where, how big." Surfacing vector scores or calibration data would require users to understand Empirica's framework to interpret them, which narrows the audience to Empirica users specifically. I'd want to see demand from non-Empirica users before adding this to CCO. But I'd genuinely encourage you to keep building this out — the rigor here is rare in the AI tooling space. Keeping this open for future reference. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Following up on @larachan-dev's question about vector reliability from #4.
The framing issue
"Self-reported" suggests the model introspects on its confidence and guesses a number. That's not what happens. The vectors are closer to an instrument reading — the model provides an initial assessment, but the system measures actual outcomes and adjusts.
How it actually works (dual-track)
Track 1 — Assessment (PREFLIGHT → POSTFLIGHT):
The AI submits 13 vectors (know, uncertainty, clarity, etc.) at PREFLIGHT (before work) and POSTFLIGHT (after work). The delta between them measures learning trajectory — what changed during the work.
Track 2 — Grounded verification (automatic, after POSTFLIGHT):
The system collects objective evidence and compares it to the self-assessment:
The gap between self-assessment and grounded evidence is the calibration error. Over 7,000+ observations, this builds a Bayesian bias profile per vector.
What happens at the CHECK gate
You're right that at CHECK time, the outcome doesn't exist yet. The gate uses:
If the AI has a history of overestimating
know, the threshold is higher — it needs to claim MORE confidence to pass because its claims are less trustworthy. This is earned autonomy: accurate self-assessment over time earns lower thresholds.Cross-session continuity
LLMs don't carry state, but git notes do. Calibration data persists in
.breadcrumbs.yamland gets injected at session start. The bias profile ("you consistently overestimate know by +0.2") carries across sessions. The AI sees directional feedback ("you tend to overestimate know") but NOT the specific thresholds — we removed those to prevent Goodhart's Law (gaming the metric).Honest limitations
engagementis ungroundable — no objective signal exists for how actively the AI is working the problem.Data we have
~7,000 calibration observations across 100+ sessions. Typical holistic calibration scores: 0.15-0.30 (lower is better, 0 = perfect). Noetic phase calibration is consistently better (0.08-0.13) than praxic (0.20-0.40) — the AI is more accurate about what it knows than about what it's done.
Happy to share anonymized calibration data if useful for your evaluation.
Beta Was this translation helpful? Give feedback.
All reactions