Skip to content

fix: temporal metadata in search results + eval flow#31

Open
7xuanlu wants to merge 13 commits intomainfrom
feature/temporal-metadata
Open

fix: temporal metadata in search results + eval flow#31
7xuanlu wants to merge 13 commits intomainfrom
feature/temporal-metadata

Conversation

@7xuanlu
Copy link
Copy Markdown
Owner

@7xuanlu 7xuanlu commented Apr 28, 2026

Summary

Threads benchmark session dates through Origin's eval pipeline so the LLM-judged accuracy harness can reason about temporal questions. LongMemEval-TR was 42.1%, LoCoMo-temporal was 1.6% — the worst categories on each benchmark.

Rebased onto main after PR #29 merged, and the adversarial review specifically targeted that integration. It caught two critical bugs that are now fixed in da2f125c:

  • C1: PR fix: full-pipeline eval + source overlap concept gate + 3 production bugs #29's new run_fullpipeline_locomo_batch and run_fullpipeline_lme_batch seeded memories with chrono::Utc::now(), silently nullifying the temporal work for the fullpipeline path. Now seed via seed_last_modified(mem.session_date.as_deref(), parser).
  • C2: same paths used a date-blind system prompt. Now use build_e2e_system_prompt(question_date) — LoCoMo passes the latest session_date in the sample (questions follow the conversation); LME passes sample.question_date directly.
  • I2: seed_last_modified now log::warn!s on parse failures, so a future regression to now() is visible in eval logs instead of silent.

What's in this PR

  • SearchResult exposes created_at: i64 populated from chunks.created_at. #[derive(Default)] added so downstream MIT consumers (origin-mcp) can spread struct literals safely.
  • New crates/origin-core/src/eval/dates.rs: parse_lme_date ("2023/04/10 (Mon) 23:07"), parse_locomo_date ("1:56 pm on 8 May, 2023"), format_ymd, and seed_last_modified helper that DRYs all the seed-site copies.
  • LongMemEvalMemory and LocomoMemory carry session_date: Option<String> populated from haystack_dates[i] and conversation.session_N_date_time respectively.
  • All real-data seed sites — including PR fix: full-pipeline eval + source overlap concept gate + 3 production bugs #29's new fullpipeline batch paths — use seed_last_modified(mem.session_date.as_deref(), parser) instead of now().
  • generate_e2e_answers_for_question and both fullpipeline batch paths build context as "On YYYY-MM-DD: <content>" lines and use build_e2e_system_prompt(question_date) for the date anchor.
  • New LongMemEval Claude CLI eval entry points (generate_e2e_context_tuples_longmemeval_api + judge_e2e_context_longmemeval_api_haiku) so the temporal lift can be measured via Max-plan CLI without API key cost.

Test plan

  • cargo test --workspace --lib — 968 lib tests pass, 21 ignored
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • Adversarial fresh-eye review against post-PR-fix: full-pipeline eval + source overlap concept gate + 3 production bugs #29 integrated state — 2 critical issues found, both fixed
  • CI lane (clippy + workspace tests + frontend tests)
  • Coverage workflow lane (informational)
  • Local-only manual: cargo test -p origin --test eval_harness generate_e2e_context_tuples_locomo_api -- --ignored then judge_e2e_context_locomo_api_sonnet — measure LoCoMo-temporal lift via Claude CLI
  • Local-only manual: same for LME via *_longmemeval_api + *_haiku judge — measure LongMemEval-TR lift

Out of scope (follow-up)

  • Wiring the dormant judge::lme_answer_prompt into generate_e2e_answers_for_question for full per-task-type branches.
  • Tightening timezone assumptions in parse_lme_date / parse_locomo_date if dataset spec says non-UTC.
  • I1 from the review: legacy run_e2e_locomo_eval (used only in retrieval/token_efficiency tests) still has no date prefix. Not a blocker for this branch's goal.

Closes/replaces #28 (auto-closed when its base feature/fullpipeline-eval was deleted on PR #29 merge).

🤖 Generated with Claude Code

7xuanlu and others added 13 commits April 27, 2026 21:14
Adds an i64 created_at field to SearchResult and populates it from
chunks.created_at in row_to_search_result. Foundation for date filtering
and date-aware eval prompts. Existing last_modified semantics unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Followup to the prior commit's struct change — origin-types' own
search_result_serializes test still constructs SearchResult literally
and needed the new field.
Carries the per-session haystack_dates through LongMemEvalMemory and
into RawDocument.last_modified during retrieve_for_accuracy_eval.
Adds parse_lme_date helper and round-trip tests.

Foundation for date-aware temporal-reasoning prompts (Task 4).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds parse_locomo_date for the dataset's '1:56 pm on 8 May, 2023' format
and threads conversation.session_N_date_time through LocomoMemory into
RawDocument.last_modified at every seed site. Mirrors Task 2's LME
treatment.
Threads session dates into the 5 remaining RawDocument seed sites in
answer_quality.rs and pipeline.rs (covers both LoCoMo and LongMemEval
E2E and pipeline runners). Adds eval::shared::format_ymd and rewrites
generate_e2e_answers_for_question's context to emit 'On YYYY-MM-DD: ...'
lines so the LLM judge can reason about temporal questions.

Targets the temporal-reasoning weakness on both benchmarks (LME-TR 42.1%,
LoCoMo-temporal 1.6% pre-change).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dified

Code-review followup:
- Moves parse_lme_date, parse_locomo_date, and format_ymd to a single
  dates.rs module instead of three different homes.
- Replaces the 13 verbatim copies of the
  '.as_deref().and_then(parser).unwrap_or_else(|| now())' chain with
  one seed_last_modified helper.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two adversarial-review followups before merge:

1. SearchResult was missing a Default derive; downstream MIT crates
   (origin-mcp) construct it field-by-field and would break when
   origin-types 0.1.5 publishes with the new created_at i64. Adding
   the derive lets them spread defaults.
2. LongMemEval per-question dates (sample.question_date) were not
   reaching the LLM. The seeded memories now carry session dates
   thanks to earlier commits, but without a 'today' anchor the LLM
   cannot ground relative phrases ('yesterday', 'a week ago') in
   the question. generate_e2e_answers_for_question now accepts
   Option<&str> and prepends 'The question was asked on X.' to the
   system prompt. LoCoMo passes None (no per-question date in dataset).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirror the LoCoMo _api pattern for LongMemEval: ClaudeCliProvider::haiku()
for answers (no API key, uses Max plan via OAuth) and judge_with_claude_model
'haiku' for judging. Same answer/judge model on both sides keeps
LME and LoCoMo numbers comparable.

Both new test entry points exercise the dated-context formatter and the
question_date system-prompt anchor introduced earlier in this branch:
  - generate_e2e_context_tuples_longmemeval_api
  - judge_e2e_context_longmemeval_api_haiku
Adversarial review caught that PR #29's new run_fullpipeline_locomo_batch
and run_fullpipeline_lme_batch seeded all memories with chrono::Utc::now()
and used a date-blind system prompt — silently nullifying the temporal
metadata work for the fullpipeline path (every memory would have been
prefixed "On 2026-04-27" regardless of original session date).

Fixes:
- Extract build_e2e_system_prompt(question_date) helper; refactor
  generate_e2e_answers_for_question to use it.
- Thread mem.session_date through seed_last_modified at both batch seed
  sites (LoCoMo: parse_locomo_date, LME: parse_lme_date).
- Thread question_date into batch system prompts: LoCoMo uses the latest
  session_date in the sample (questions follow the conversation); LME
  uses sample.question_date directly.
- seed_last_modified now log::warn\!s on parse failures so future silent
  regressions to now() are visible in eval logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, the Claude CLI eval entry points
(generate_e2e_context_tuples_locomo_api, _longmemeval_api) call a
duplicate generate_e2e_answers_for_question in token_efficiency.rs that
predates the temporal work — no date prefixes, no question_date in the
system prompt, seeds with chrono::Utc::now(). The CLI eval would have
shown zero lift on temporal categories regardless of how the rest of
the pipeline handles dates.

- Delete the duplicate at token_efficiency.rs:4472-4572.
- Make crate::eval::answer_quality::generate_e2e_answers_for_question
  pub(crate) and route both callers through it.
- run_e2e_context_eval (LoCoMo CLI path): seed via seed_last_modified
  with parse_locomo_date; compute sample_question_date as the latest
  parseable session_date in the sample (questions follow the
  conversation in LoCoMo); pass through to the answer call.
- run_e2e_context_eval_longmemeval (LME CLI path): seed via
  seed_last_modified with parse_lme_date; pass sample.question_date
  directly to the answer call.

The remaining 6 chrono::Utc::now() seed sites in this file
(run_e2e_locomo_eval, run_locomo_pipeline_eval, *_pipeline_eval,
run_context_path_eval*) are not on the CLI eval critical path; punt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClaudeCliProvider::generate now does .env_remove("ANTHROPIC_API_KEY")
before spawning `claude`. When the env var is set, the CLI routes
through the paid API instead of the user's Max-plan OAuth, so an eval
that intends to use Max would silently route through API credits and
fail with "Credit balance is too low" if the API account has none.
The CLI provider's whole purpose is Max OAuth, so always scrub.

Add temporal_smoke_locomo_5q (#[ignore]'d) — A/B compares date-aware
context (date prefix + question_date system anchor) vs date-blind
context on 5 LoCoMo temporal questions, no enrichment, search_limit
30 with diagnostic logging of top-3 retrieval hits. Designed as a
fast (~1-2 min) signal check before committing to full eval runs.

Promote to pub: parse_locomo_date, parse_lme_date, seed_last_modified
(integration tests in app/ live outside the crate and need pub).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: seeding memories with their real session/event dates made search
recency decay (exp(-decay_rate * age_days)) crush the score to ~0 for
any content older than a few months. The retrieval diagnostic showed
all 50 results returning score=0.000 for benchmark seeds dated 2023
when the test ran in 2026 — and the ranking became noise. Same bug
would hit any user who imports an old email archive, conversation
backfill, or scanned-document set.

Root cause: last_modified was carrying two different concepts —
"when did the event happen?" (for display) and "when was this row
ingested?" (for staleness). Recency decay correctly penalises stale
ingestion, but we were feeding it event time.

Fix (Option A): introduce event_date as a separate optional field on
RawDocument and SearchResult. last_modified stays anchored to ingestion
/ edit time; event_date carries the display time. Date-prefix
rendering uses event_date.unwrap_or(last_modified). Recency decay
continues to use last_modified, so old-but-just-imported content ranks
fresh.

- crates/origin-types: add event_date: Option<i64> to RawDocument and
  SearchResult. #[serde(default)] keeps origin-mcp wire-compatible.
- crates/origin-core/src/db.rs: migration 44 adds memories.event_date
  (NULL-able, no backfill — old rows stay None and fall back to
  last_modified for display). All search SELECTs include c.event_date
  at column 29; score column moves to 30. row_to_search_result reads
  Option<i64>. INSERT writes event_date.
- Inline recency-decay match in search_memory replaced with
  crate::sources::decay_rate(&tier, &ConfidenceConfig::default()) so
  the canonical decay function is the single source of truth.
- Eval seed sites (locomo, longmemeval, answer_quality, pipeline) now
  set last_modified=now() and event_date=seed_event_date(...).
- Renamed seed_last_modified -> seed_event_date and changed its
  return type to Option<i64> — None on parse failure rather than
  silently falling back to "today" (with a log::warn for visibility).
- Regression test: a chunk with 3-year-old event_date but fresh
  last_modified must score > 0.001 (pre-fix it was ~1.7e-5).

Verified: retrieval diagnostic on LoCoMo conv-26 (184 obs) after this
change shows real scores (0.046–0.079) and the previously-missing
target memories now rank within top-50: "support group" rank 50,
"sunrise" rank 37, "charity race" rank 33, "camping" rank 1. Pre-fix
none of these were retrievable at any K because scores were noise.

Search ranking quality at top-K is a separate concern not addressed
here — but is now actually addressable, since retrieval is no longer
blocked by recency decay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-push caught a missed RawDocument literal in origin-server's
batcher tests after the event_date field was added in 8c94e88.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant