fix: temporal metadata in search results + eval flow by 7xuanlu · Pull Request #31 · 7xuanlu/origin

7xuanlu · 2026-04-28T05:01:54Z

Summary

Threads benchmark session dates through Origin's eval pipeline so the LLM-judged accuracy harness can reason about temporal questions. LongMemEval-TR was 42.1%, LoCoMo-temporal was 1.6% — the worst categories on each benchmark.

Rebased onto main after PR #29 merged, and the adversarial review specifically targeted that integration. It caught two critical bugs that are now fixed in da2f125c:

C1: PR fix: full-pipeline eval + source overlap concept gate + 3 production bugs #29's new run_fullpipeline_locomo_batch and run_fullpipeline_lme_batch seeded memories with chrono::Utc::now(), silently nullifying the temporal work for the fullpipeline path. Now seed via seed_last_modified(mem.session_date.as_deref(), parser).
C2: same paths used a date-blind system prompt. Now use build_e2e_system_prompt(question_date) — LoCoMo passes the latest session_date in the sample (questions follow the conversation); LME passes sample.question_date directly.
I2: seed_last_modified now log::warn!s on parse failures, so a future regression to now() is visible in eval logs instead of silent.

What's in this PR

SearchResult exposes created_at: i64 populated from chunks.created_at. #[derive(Default)] added so downstream MIT consumers (origin-mcp) can spread struct literals safely.
New crates/origin-core/src/eval/dates.rs: parse_lme_date ("2023/04/10 (Mon) 23:07"), parse_locomo_date ("1:56 pm on 8 May, 2023"), format_ymd, and seed_last_modified helper that DRYs all the seed-site copies.
LongMemEvalMemory and LocomoMemory carry session_date: Option<String> populated from haystack_dates[i] and conversation.session_N_date_time respectively.
All real-data seed sites — including PR fix: full-pipeline eval + source overlap concept gate + 3 production bugs #29's new fullpipeline batch paths — use seed_last_modified(mem.session_date.as_deref(), parser) instead of now().
generate_e2e_answers_for_question and both fullpipeline batch paths build context as "On YYYY-MM-DD: <content>" lines and use build_e2e_system_prompt(question_date) for the date anchor.
New LongMemEval Claude CLI eval entry points (generate_e2e_context_tuples_longmemeval_api + judge_e2e_context_longmemeval_api_haiku) so the temporal lift can be measured via Max-plan CLI without API key cost.

Test plan

cargo test --workspace --lib — 968 lib tests pass, 21 ignored
cargo clippy --workspace --all-targets -- -D warnings — clean
Adversarial fresh-eye review against post-PR-fix: full-pipeline eval + source overlap concept gate + 3 production bugs #29 integrated state — 2 critical issues found, both fixed
CI lane (clippy + workspace tests + frontend tests)
Coverage workflow lane (informational)
Local-only manual: cargo test -p origin --test eval_harness generate_e2e_context_tuples_locomo_api -- --ignored then judge_e2e_context_locomo_api_sonnet — measure LoCoMo-temporal lift via Claude CLI
Local-only manual: same for LME via *_longmemeval_api + *_haiku judge — measure LongMemEval-TR lift

Out of scope (follow-up)

Wiring the dormant judge::lme_answer_prompt into generate_e2e_answers_for_question for full per-task-type branches.
Tightening timezone assumptions in parse_lme_date / parse_locomo_date if dataset spec says non-UTC.
I1 from the review: legacy run_e2e_locomo_eval (used only in retrieval/token_efficiency tests) still has no date prefix. Not a blocker for this branch's goal.

Closes/replaces #28 (auto-closed when its base feature/fullpipeline-eval was deleted on PR #29 merge).

🤖 Generated with Claude Code

Adds an i64 created_at field to SearchResult and populates it from chunks.created_at in row_to_search_result. Foundation for date filtering and date-aware eval prompts. Existing last_modified semantics unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Followup to the prior commit's struct change — origin-types' own search_result_serializes test still constructs SearchResult literally and needed the new field.

Carries the per-session haystack_dates through LongMemEvalMemory and into RawDocument.last_modified during retrieve_for_accuracy_eval. Adds parse_lme_date helper and round-trip tests. Foundation for date-aware temporal-reasoning prompts (Task 4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds parse_locomo_date for the dataset's '1:56 pm on 8 May, 2023' format and threads conversation.session_N_date_time through LocomoMemory into RawDocument.last_modified at every seed site. Mirrors Task 2's LME treatment.

Threads session dates into the 5 remaining RawDocument seed sites in answer_quality.rs and pipeline.rs (covers both LoCoMo and LongMemEval E2E and pipeline runners). Adds eval::shared::format_ymd and rewrites generate_e2e_answers_for_question's context to emit 'On YYYY-MM-DD: ...' lines so the LLM judge can reason about temporal questions. Targets the temporal-reasoning weakness on both benchmarks (LME-TR 42.1%, LoCoMo-temporal 1.6% pre-change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dified Code-review followup: - Moves parse_lme_date, parse_locomo_date, and format_ymd to a single dates.rs module instead of three different homes. - Replaces the 13 verbatim copies of the '.as_deref().and_then(parser).unwrap_or_else(|| now())' chain with one seed_last_modified helper. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two adversarial-review followups before merge: 1. SearchResult was missing a Default derive; downstream MIT crates (origin-mcp) construct it field-by-field and would break when origin-types 0.1.5 publishes with the new created_at i64. Adding the derive lets them spread defaults. 2. LongMemEval per-question dates (sample.question_date) were not reaching the LLM. The seeded memories now carry session dates thanks to earlier commits, but without a 'today' anchor the LLM cannot ground relative phrases ('yesterday', 'a week ago') in the question. generate_e2e_answers_for_question now accepts Option<&str> and prepends 'The question was asked on X.' to the system prompt. LoCoMo passes None (no per-question date in dataset). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mirror the LoCoMo _api pattern for LongMemEval: ClaudeCliProvider::haiku() for answers (no API key, uses Max plan via OAuth) and judge_with_claude_model 'haiku' for judging. Same answer/judge model on both sides keeps LME and LoCoMo numbers comparable. Both new test entry points exercise the dated-context formatter and the question_date system-prompt anchor introduced earlier in this branch: - generate_e2e_context_tuples_longmemeval_api - judge_e2e_context_longmemeval_api_haiku

Adversarial review caught that PR #29's new run_fullpipeline_locomo_batch and run_fullpipeline_lme_batch seeded all memories with chrono::Utc::now() and used a date-blind system prompt — silently nullifying the temporal metadata work for the fullpipeline path (every memory would have been prefixed "On 2026-04-27" regardless of original session date). Fixes: - Extract build_e2e_system_prompt(question_date) helper; refactor generate_e2e_answers_for_question to use it. - Thread mem.session_date through seed_last_modified at both batch seed sites (LoCoMo: parse_locomo_date, LME: parse_lme_date). - Thread question_date into batch system prompts: LoCoMo uses the latest session_date in the sample (questions follow the conversation); LME uses sample.question_date directly. - seed_last_modified now log::warn\!s on parse failures so future silent regressions to now() are visible in eval logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without this, the Claude CLI eval entry points (generate_e2e_context_tuples_locomo_api, _longmemeval_api) call a duplicate generate_e2e_answers_for_question in token_efficiency.rs that predates the temporal work — no date prefixes, no question_date in the system prompt, seeds with chrono::Utc::now(). The CLI eval would have shown zero lift on temporal categories regardless of how the rest of the pipeline handles dates. - Delete the duplicate at token_efficiency.rs:4472-4572. - Make crate::eval::answer_quality::generate_e2e_answers_for_question pub(crate) and route both callers through it. - run_e2e_context_eval (LoCoMo CLI path): seed via seed_last_modified with parse_locomo_date; compute sample_question_date as the latest parseable session_date in the sample (questions follow the conversation in LoCoMo); pass through to the answer call. - run_e2e_context_eval_longmemeval (LME CLI path): seed via seed_last_modified with parse_lme_date; pass sample.question_date directly to the answer call. The remaining 6 chrono::Utc::now() seed sites in this file (run_e2e_locomo_eval, run_locomo_pipeline_eval, *_pipeline_eval, run_context_path_eval*) are not on the CLI eval critical path; punt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ClaudeCliProvider::generate now does .env_remove("ANTHROPIC_API_KEY") before spawning `claude`. When the env var is set, the CLI routes through the paid API instead of the user's Max-plan OAuth, so an eval that intends to use Max would silently route through API credits and fail with "Credit balance is too low" if the API account has none. The CLI provider's whole purpose is Max OAuth, so always scrub. Add temporal_smoke_locomo_5q (#[ignore]'d) — A/B compares date-aware context (date prefix + question_date system anchor) vs date-blind context on 5 LoCoMo temporal questions, no enrichment, search_limit 30 with diagnostic logging of top-3 retrieval hits. Designed as a fast (~1-2 min) signal check before committing to full eval runs. Promote to pub: parse_locomo_date, parse_lme_date, seed_last_modified (integration tests in app/ live outside the crate and need pub). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug: seeding memories with their real session/event dates made search recency decay (exp(-decay_rate * age_days)) crush the score to ~0 for any content older than a few months. The retrieval diagnostic showed all 50 results returning score=0.000 for benchmark seeds dated 2023 when the test ran in 2026 — and the ranking became noise. Same bug would hit any user who imports an old email archive, conversation backfill, or scanned-document set. Root cause: last_modified was carrying two different concepts — "when did the event happen?" (for display) and "when was this row ingested?" (for staleness). Recency decay correctly penalises stale ingestion, but we were feeding it event time. Fix (Option A): introduce event_date as a separate optional field on RawDocument and SearchResult. last_modified stays anchored to ingestion / edit time; event_date carries the display time. Date-prefix rendering uses event_date.unwrap_or(last_modified). Recency decay continues to use last_modified, so old-but-just-imported content ranks fresh. - crates/origin-types: add event_date: Option<i64> to RawDocument and SearchResult. #[serde(default)] keeps origin-mcp wire-compatible. - crates/origin-core/src/db.rs: migration 44 adds memories.event_date (NULL-able, no backfill — old rows stay None and fall back to last_modified for display). All search SELECTs include c.event_date at column 29; score column moves to 30. row_to_search_result reads Option<i64>. INSERT writes event_date. - Inline recency-decay match in search_memory replaced with crate::sources::decay_rate(&tier, &ConfidenceConfig::default()) so the canonical decay function is the single source of truth. - Eval seed sites (locomo, longmemeval, answer_quality, pipeline) now set last_modified=now() and event_date=seed_event_date(...). - Renamed seed_last_modified -> seed_event_date and changed its return type to Option<i64> — None on parse failure rather than silently falling back to "today" (with a log::warn for visibility). - Regression test: a chunk with 3-year-old event_date but fresh last_modified must score > 0.001 (pre-fix it was ~1.7e-5). Verified: retrieval diagnostic on LoCoMo conv-26 (184 obs) after this change shows real scores (0.046–0.079) and the previously-missing target memories now rank within top-50: "support group" rank 50, "sunrise" rank 37, "charity race" rank 33, "camping" rank 1. Pre-fix none of these were retrievable at any K because scores were noise. Search ranking quality at top-K is a separate concern not addressed here — but is now actually addressable, since retrieval is no longer blocked by recency decay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-push caught a missed RawDocument literal in origin-server's batcher tests after the event_date field was added in 8c94e88. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

7xuanlu and others added 13 commits April 27, 2026 21:14

fix: add created_at to SearchResult test fixture

8fb2db3

Followup to the prior commit's struct change — origin-types' own search_result_serializes test still constructs SearchResult literally and needed the new field.

fix: propagate LoCoMo session dates into seeded chunks

97127d8

Adds parse_locomo_date for the dataset's '1:56 pm on 8 May, 2023' format and threads conversation.session_N_date_time through LocomoMemory into RawDocument.last_modified at every seed site. Mirrors Task 2's LME treatment.

fix: add event_date to ingest_batcher test fixture

fd4c9ed

Pre-push caught a missed RawDocument literal in origin-server's batcher tests after the event_date field was added in 8c94e88. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: temporal metadata in search results + eval flow#31

fix: temporal metadata in search results + eval flow#31
7xuanlu wants to merge 13 commits intomainfrom
feature/temporal-metadata

7xuanlu commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

7xuanlu commented Apr 28, 2026

Summary

What's in this PR

Test plan

Out of scope (follow-up)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant