Avoid blocking startup on corpus embedding rebuild by RaviTharuma · Pull Request #28 · AVIDS2/memorix

RaviTharuma · 2026-03-27T22:27:21Z

Summary

This changes Memorix startup and hot-reload to prepare the lexical search index without blocking on a corpus-wide embedding rebuild.

Fixes #27.

Problem

createMemorixServer() used the heavy reindexObservations() path during normal startup and file-watch reloads.

On large projects with API embeddings enabled, that meant MCP availability depended on:

full corpus embedding generation
provider retry loops
provider throughput / rate limits

That is too expensive for startup. Clients need the MCP and lexical search to come up first.

What changed

added prepareSearchIndex() in src/memory/observations.ts
startup and hot reload now call prepareSearchIndex() instead of the heavy rebuild path
prepareSearchIndex() hydrates the lexical/BM25 index and skips batch embedding generation
active observations are queued for the existing background vector backfill when embeddings are enabled
added regression coverage in tests/memory/prepare-search-index.test.ts

Why this approach

This keeps the heavy rebuild path available for explicit reindexing while removing startup's dependency on embedding-provider latency.

That makes the behavior provider-agnostic:

startup no longer blocks on API embedding throughput
lexical/fulltext search becomes available immediately
vectors can still converge via the existing async backfill path

Verification

npm test -- tests/memory/prepare-search-index.test.ts tests/memory/reindex-embeddings.test.ts
npx vitest run tests/integration/release-blockers.test.ts
bunx tsc --noEmit
npm run build

Notes

Running the full npm test suite from a fork checkout still exposes two existing git-remote expectation tests (tests/git/extractor.test.ts, tests/git/subdir-scan.test.ts) that assert the upstream repo id instead of the current fork remote. I did not change that behavior in this PR.

chatgpt-codex-connector · 2026-03-27T22:27:25Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

AVIDS2 · 2026-03-28T15:29:15Z

I reviewed this one manually as well. The startup direction is good and I agree with the goal: MCP availability should not be blocked on a corpus-wide embedding rebuild.

I do think there is one issue to address before merge:

prepareSearchIndex() currently relies on hydrateIndex(), and hydrateIndex() only inserts observations whose status is active. That means startup/hot-reload would bring up an index that is missing resolved / archived observations entirely, so queries that explicitly ask for status=resolved or status=all would no longer be faithful after startup.

So my current read is:

direction: yes
worth merging: yes
but this needs one small correction first so startup hydration preserves the expected searchable corpus shape

Once that is fixed and rebased on the latest main, I’m happy to take another pass quickly.

AVIDS2 · 2026-03-29T03:53:34Z

Thanks for iterating on this — the startup direction still looks right, and I agree with the goal of getting lexical/BM25 search available before a full embedding rebuild finishes.

One blocker is still present though: prepareSearchIndex() currently hydrates through hydrateIndex(), and hydrateIndex() still only inserts observations whose status is active.

That means after startup / hot reload, resolved and archived observations are still missing from the in-memory index, so searches using status=resolved or status=all will be incomplete even though the persisted corpus contains them.

So from the maintainer side, I’m not comfortable merging this yet as-is. If you can fix that part and rebase on the latest main (which now includes #29), this should be in much better shape.

Remove the active-only filter in hydrateIndex() so resolved and archived observations are indexed at startup. Status filtering belongs at query time (searchObservations), not index time. Add 4-case hydrate-index test covering: - indexes active + resolved + archived observations - stores status field faithfully for per-status queries - skips malformed observations gracefully - idempotent re-hydration is a no-op

RaviTharuma · 2026-03-29T22:07:12Z

Rebased on current main (including #29) and added the hydration fix you requested:

What changed:

Removed the if ((obs.status ?? 'active') !== 'active') continue; filter in hydrateIndex() so resolved/archived observations are indexed at startup. Status filtering stays where it belongs — at query time in searchObservations().
Added tests/store/hydrate-index.test.ts with 4 test cases:
1. Indexes active + resolved + archived observations (all statuses)
2. Stores status field faithfully — per-status queries return correct results
3. Skips malformed observations gracefully (no crash)
4. Idempotent re-hydration is a no-op (calling twice doesn't duplicate)

All 4 hydration tests pass. Full suite shows no regressions (883 pass, same baseline).

AVIDS2 · 2026-03-30T13:36:19Z

Thanks for the update — the startup direction is still the right one, and the active-only hydration issue I called out earlier is fixed.

I do still see one follow-up problem before I'd merge this:

prepareSearchIndex() now queues only active observation IDs into vectorMissingIds. Because persisted observations do not carry stored embeddings, startup hydration rebuilds the lexical index for all statuses, but only active observations are eligible for async vector backfill afterwards. That means status=resolved / status=all queries can become permanently asymmetric after restart: resolved/archived memories remain lexical-only while active memories regain hybrid/vector behavior.

So from the maintainer side my current read is:

startup availability: improved
active-only lexical index regression: fixed
remaining blocker: non-active observations still miss post-startup vector recovery

If you want to keep the lexical-first startup model, I think this needs one more pass so the backfill policy for non-active observations is explicit rather than an accidental regression.

RaviTharuma added 2 commits March 29, 2026 23:48

Avoid blocking startup on corpus embedding rebuild

2e528e3

RaviTharuma force-pushed the fix/nonblocking-startup-index branch from 0e2cd3c to f8e7315 Compare March 29, 2026 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid blocking startup on corpus embedding rebuild#28

Avoid blocking startup on corpus embedding rebuild#28
RaviTharuma wants to merge 2 commits intoAVIDS2:mainfrom
RaviTharuma:fix/nonblocking-startup-index

RaviTharuma commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 27, 2026

Uh oh!

AVIDS2 commented Mar 28, 2026

Uh oh!

AVIDS2 commented Mar 29, 2026

Uh oh!

RaviTharuma commented Mar 29, 2026

Uh oh!

AVIDS2 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaviTharuma commented Mar 27, 2026

Summary

Problem

What changed

Why this approach

Verification

Notes

Uh oh!

chatgpt-codex-connector bot commented Mar 27, 2026

Uh oh!

AVIDS2 commented Mar 28, 2026

Uh oh!

AVIDS2 commented Mar 29, 2026

Uh oh!

RaviTharuma commented Mar 29, 2026

Uh oh!

AVIDS2 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants