fix: temporal metadata in search results + eval flow#31
Open
fix: temporal metadata in search results + eval flow#31
Conversation
Adds an i64 created_at field to SearchResult and populates it from chunks.created_at in row_to_search_result. Foundation for date filtering and date-aware eval prompts. Existing last_modified semantics unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Followup to the prior commit's struct change — origin-types' own search_result_serializes test still constructs SearchResult literally and needed the new field.
Carries the per-session haystack_dates through LongMemEvalMemory and into RawDocument.last_modified during retrieve_for_accuracy_eval. Adds parse_lme_date helper and round-trip tests. Foundation for date-aware temporal-reasoning prompts (Task 4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds parse_locomo_date for the dataset's '1:56 pm on 8 May, 2023' format and threads conversation.session_N_date_time through LocomoMemory into RawDocument.last_modified at every seed site. Mirrors Task 2's LME treatment.
Threads session dates into the 5 remaining RawDocument seed sites in answer_quality.rs and pipeline.rs (covers both LoCoMo and LongMemEval E2E and pipeline runners). Adds eval::shared::format_ymd and rewrites generate_e2e_answers_for_question's context to emit 'On YYYY-MM-DD: ...' lines so the LLM judge can reason about temporal questions. Targets the temporal-reasoning weakness on both benchmarks (LME-TR 42.1%, LoCoMo-temporal 1.6% pre-change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dified Code-review followup: - Moves parse_lme_date, parse_locomo_date, and format_ymd to a single dates.rs module instead of three different homes. - Replaces the 13 verbatim copies of the '.as_deref().and_then(parser).unwrap_or_else(|| now())' chain with one seed_last_modified helper. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two adversarial-review followups before merge:
1. SearchResult was missing a Default derive; downstream MIT crates
(origin-mcp) construct it field-by-field and would break when
origin-types 0.1.5 publishes with the new created_at i64. Adding
the derive lets them spread defaults.
2. LongMemEval per-question dates (sample.question_date) were not
reaching the LLM. The seeded memories now carry session dates
thanks to earlier commits, but without a 'today' anchor the LLM
cannot ground relative phrases ('yesterday', 'a week ago') in
the question. generate_e2e_answers_for_question now accepts
Option<&str> and prepends 'The question was asked on X.' to the
system prompt. LoCoMo passes None (no per-question date in dataset).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirror the LoCoMo _api pattern for LongMemEval: ClaudeCliProvider::haiku() for answers (no API key, uses Max plan via OAuth) and judge_with_claude_model 'haiku' for judging. Same answer/judge model on both sides keeps LME and LoCoMo numbers comparable. Both new test entry points exercise the dated-context formatter and the question_date system-prompt anchor introduced earlier in this branch: - generate_e2e_context_tuples_longmemeval_api - judge_e2e_context_longmemeval_api_haiku
Adversarial review caught that PR #29's new run_fullpipeline_locomo_batch and run_fullpipeline_lme_batch seeded all memories with chrono::Utc::now() and used a date-blind system prompt — silently nullifying the temporal metadata work for the fullpipeline path (every memory would have been prefixed "On 2026-04-27" regardless of original session date). Fixes: - Extract build_e2e_system_prompt(question_date) helper; refactor generate_e2e_answers_for_question to use it. - Thread mem.session_date through seed_last_modified at both batch seed sites (LoCoMo: parse_locomo_date, LME: parse_lme_date). - Thread question_date into batch system prompts: LoCoMo uses the latest session_date in the sample (questions follow the conversation); LME uses sample.question_date directly. - seed_last_modified now log::warn\!s on parse failures so future silent regressions to now() are visible in eval logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, the Claude CLI eval entry points (generate_e2e_context_tuples_locomo_api, _longmemeval_api) call a duplicate generate_e2e_answers_for_question in token_efficiency.rs that predates the temporal work — no date prefixes, no question_date in the system prompt, seeds with chrono::Utc::now(). The CLI eval would have shown zero lift on temporal categories regardless of how the rest of the pipeline handles dates. - Delete the duplicate at token_efficiency.rs:4472-4572. - Make crate::eval::answer_quality::generate_e2e_answers_for_question pub(crate) and route both callers through it. - run_e2e_context_eval (LoCoMo CLI path): seed via seed_last_modified with parse_locomo_date; compute sample_question_date as the latest parseable session_date in the sample (questions follow the conversation in LoCoMo); pass through to the answer call. - run_e2e_context_eval_longmemeval (LME CLI path): seed via seed_last_modified with parse_lme_date; pass sample.question_date directly to the answer call. The remaining 6 chrono::Utc::now() seed sites in this file (run_e2e_locomo_eval, run_locomo_pipeline_eval, *_pipeline_eval, run_context_path_eval*) are not on the CLI eval critical path; punt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClaudeCliProvider::generate now does .env_remove("ANTHROPIC_API_KEY")
before spawning `claude`. When the env var is set, the CLI routes
through the paid API instead of the user's Max-plan OAuth, so an eval
that intends to use Max would silently route through API credits and
fail with "Credit balance is too low" if the API account has none.
The CLI provider's whole purpose is Max OAuth, so always scrub.
Add temporal_smoke_locomo_5q (#[ignore]'d) — A/B compares date-aware
context (date prefix + question_date system anchor) vs date-blind
context on 5 LoCoMo temporal questions, no enrichment, search_limit
30 with diagnostic logging of top-3 retrieval hits. Designed as a
fast (~1-2 min) signal check before committing to full eval runs.
Promote to pub: parse_locomo_date, parse_lme_date, seed_last_modified
(integration tests in app/ live outside the crate and need pub).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: seeding memories with their real session/event dates made search recency decay (exp(-decay_rate * age_days)) crush the score to ~0 for any content older than a few months. The retrieval diagnostic showed all 50 results returning score=0.000 for benchmark seeds dated 2023 when the test ran in 2026 — and the ranking became noise. Same bug would hit any user who imports an old email archive, conversation backfill, or scanned-document set. Root cause: last_modified was carrying two different concepts — "when did the event happen?" (for display) and "when was this row ingested?" (for staleness). Recency decay correctly penalises stale ingestion, but we were feeding it event time. Fix (Option A): introduce event_date as a separate optional field on RawDocument and SearchResult. last_modified stays anchored to ingestion / edit time; event_date carries the display time. Date-prefix rendering uses event_date.unwrap_or(last_modified). Recency decay continues to use last_modified, so old-but-just-imported content ranks fresh. - crates/origin-types: add event_date: Option<i64> to RawDocument and SearchResult. #[serde(default)] keeps origin-mcp wire-compatible. - crates/origin-core/src/db.rs: migration 44 adds memories.event_date (NULL-able, no backfill — old rows stay None and fall back to last_modified for display). All search SELECTs include c.event_date at column 29; score column moves to 30. row_to_search_result reads Option<i64>. INSERT writes event_date. - Inline recency-decay match in search_memory replaced with crate::sources::decay_rate(&tier, &ConfidenceConfig::default()) so the canonical decay function is the single source of truth. - Eval seed sites (locomo, longmemeval, answer_quality, pipeline) now set last_modified=now() and event_date=seed_event_date(...). - Renamed seed_last_modified -> seed_event_date and changed its return type to Option<i64> — None on parse failure rather than silently falling back to "today" (with a log::warn for visibility). - Regression test: a chunk with 3-year-old event_date but fresh last_modified must score > 0.001 (pre-fix it was ~1.7e-5). Verified: retrieval diagnostic on LoCoMo conv-26 (184 obs) after this change shows real scores (0.046–0.079) and the previously-missing target memories now rank within top-50: "support group" rank 50, "sunrise" rank 37, "charity race" rank 33, "camping" rank 1. Pre-fix none of these were retrievable at any K because scores were noise. Search ranking quality at top-K is a separate concern not addressed here — but is now actually addressable, since retrieval is no longer blocked by recency decay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-push caught a missed RawDocument literal in origin-server's batcher tests after the event_date field was added in 8c94e88. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Threads benchmark session dates through Origin's eval pipeline so the LLM-judged accuracy harness can reason about temporal questions. LongMemEval-TR was 42.1%, LoCoMo-temporal was 1.6% — the worst categories on each benchmark.
Rebased onto
mainafter PR #29 merged, and the adversarial review specifically targeted that integration. It caught two critical bugs that are now fixed inda2f125c:run_fullpipeline_locomo_batchandrun_fullpipeline_lme_batchseeded memories withchrono::Utc::now(), silently nullifying the temporal work for the fullpipeline path. Now seed viaseed_last_modified(mem.session_date.as_deref(), parser).build_e2e_system_prompt(question_date)— LoCoMo passes the latest session_date in the sample (questions follow the conversation); LME passessample.question_datedirectly.seed_last_modifiednowlog::warn!s on parse failures, so a future regression tonow()is visible in eval logs instead of silent.What's in this PR
SearchResultexposescreated_at: i64populated fromchunks.created_at.#[derive(Default)]added so downstream MIT consumers (origin-mcp) can spread struct literals safely.crates/origin-core/src/eval/dates.rs:parse_lme_date("2023/04/10 (Mon) 23:07"),parse_locomo_date("1:56 pm on 8 May, 2023"),format_ymd, andseed_last_modifiedhelper that DRYs all the seed-site copies.LongMemEvalMemoryandLocomoMemorycarrysession_date: Option<String>populated fromhaystack_dates[i]andconversation.session_N_date_timerespectively.seed_last_modified(mem.session_date.as_deref(), parser)instead ofnow().generate_e2e_answers_for_questionand both fullpipeline batch paths build context as"On YYYY-MM-DD: <content>"lines and usebuild_e2e_system_prompt(question_date)for the date anchor.generate_e2e_context_tuples_longmemeval_api+judge_e2e_context_longmemeval_api_haiku) so the temporal lift can be measured via Max-plan CLI without API key cost.Test plan
cargo test --workspace --lib— 968 lib tests pass, 21 ignoredcargo clippy --workspace --all-targets -- -D warnings— cleancargo test -p origin --test eval_harness generate_e2e_context_tuples_locomo_api -- --ignoredthenjudge_e2e_context_locomo_api_sonnet— measure LoCoMo-temporal lift via Claude CLI*_longmemeval_api+*_haikujudge — measure LongMemEval-TR liftOut of scope (follow-up)
judge::lme_answer_promptintogenerate_e2e_answers_for_questionfor full per-task-type branches.parse_lme_date/parse_locomo_dateif dataset spec says non-UTC.run_e2e_locomo_eval(used only in retrieval/token_efficiency tests) still has no date prefix. Not a blocker for this branch's goal.Closes/replaces #28 (auto-closed when its base
feature/fullpipeline-evalwas deleted on PR #29 merge).🤖 Generated with Claude Code