Skip to content

Avoid blocking startup on corpus embedding rebuild#28

Open
RaviTharuma wants to merge 2 commits intoAVIDS2:mainfrom
RaviTharuma:fix/nonblocking-startup-index
Open

Avoid blocking startup on corpus embedding rebuild#28
RaviTharuma wants to merge 2 commits intoAVIDS2:mainfrom
RaviTharuma:fix/nonblocking-startup-index

Conversation

@RaviTharuma
Copy link
Copy Markdown
Contributor

Summary

This changes Memorix startup and hot-reload to prepare the lexical search index without blocking on a corpus-wide embedding rebuild.

Fixes #27.

Problem

createMemorixServer() used the heavy reindexObservations() path during normal startup and file-watch reloads.

On large projects with API embeddings enabled, that meant MCP availability depended on:

  • full corpus embedding generation
  • provider retry loops
  • provider throughput / rate limits

That is too expensive for startup. Clients need the MCP and lexical search to come up first.

What changed

  • added prepareSearchIndex() in src/memory/observations.ts
  • startup and hot reload now call prepareSearchIndex() instead of the heavy rebuild path
  • prepareSearchIndex() hydrates the lexical/BM25 index and skips batch embedding generation
  • active observations are queued for the existing background vector backfill when embeddings are enabled
  • added regression coverage in tests/memory/prepare-search-index.test.ts

Why this approach

This keeps the heavy rebuild path available for explicit reindexing while removing startup's dependency on embedding-provider latency.

That makes the behavior provider-agnostic:

  • startup no longer blocks on API embedding throughput
  • lexical/fulltext search becomes available immediately
  • vectors can still converge via the existing async backfill path

Verification

  • npm test -- tests/memory/prepare-search-index.test.ts tests/memory/reindex-embeddings.test.ts
  • npx vitest run tests/integration/release-blockers.test.ts
  • bunx tsc --noEmit
  • npm run build

Notes

Running the full npm test suite from a fork checkout still exposes two existing git-remote expectation tests (tests/git/extractor.test.ts, tests/git/subdir-scan.test.ts) that assert the upstream repo id instead of the current fork remote. I did not change that behavior in this PR.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown
Owner

AVIDS2 commented Mar 28, 2026

I reviewed this one manually as well. The startup direction is good and I agree with the goal: MCP availability should not be blocked on a corpus-wide embedding rebuild.

I do think there is one issue to address before merge:

prepareSearchIndex() currently relies on hydrateIndex(), and hydrateIndex() only inserts observations whose status is active. That means startup/hot-reload would bring up an index that is missing resolved / archived observations entirely, so queries that explicitly ask for status=resolved or status=all would no longer be faithful after startup.

So my current read is:

  • direction: yes
  • worth merging: yes
  • but this needs one small correction first so startup hydration preserves the expected searchable corpus shape

Once that is fixed and rebased on the latest main, I’m happy to take another pass quickly.

Copy link
Copy Markdown
Owner

AVIDS2 commented Mar 29, 2026

Thanks for iterating on this — the startup direction still looks right, and I agree with the goal of getting lexical/BM25 search available before a full embedding rebuild finishes.

One blocker is still present though: prepareSearchIndex() currently hydrates through hydrateIndex(), and hydrateIndex() still only inserts observations whose status is active.

That means after startup / hot reload, resolved and archived observations are still missing from the in-memory index, so searches using status=resolved or status=all will be incomplete even though the persisted corpus contains them.

So from the maintainer side, I’m not comfortable merging this yet as-is. If you can fix that part and rebase on the latest main (which now includes #29), this should be in much better shape.

Remove the active-only filter in hydrateIndex() so resolved and
archived observations are indexed at startup. Status filtering
belongs at query time (searchObservations), not index time.

Add 4-case hydrate-index test covering:
- indexes active + resolved + archived observations
- stores status field faithfully for per-status queries
- skips malformed observations gracefully
- idempotent re-hydration is a no-op
@RaviTharuma RaviTharuma force-pushed the fix/nonblocking-startup-index branch from 0e2cd3c to f8e7315 Compare March 29, 2026 22:06
@RaviTharuma
Copy link
Copy Markdown
Contributor Author

Rebased on current main (including #29) and added the hydration fix you requested:

What changed:

  • Removed the if ((obs.status ?? 'active') !== 'active') continue; filter in hydrateIndex() so resolved/archived observations are indexed at startup. Status filtering stays where it belongs — at query time in searchObservations().
  • Added tests/store/hydrate-index.test.ts with 4 test cases:
    1. Indexes active + resolved + archived observations (all statuses)
    2. Stores status field faithfully — per-status queries return correct results
    3. Skips malformed observations gracefully (no crash)
    4. Idempotent re-hydration is a no-op (calling twice doesn't duplicate)

All 4 hydration tests pass. Full suite shows no regressions (883 pass, same baseline).

@AVIDS2
Copy link
Copy Markdown
Owner

AVIDS2 commented Mar 30, 2026

Thanks for the update — the startup direction is still the right one, and the active-only hydration issue I called out earlier is fixed.

I do still see one follow-up problem before I'd merge this:

prepareSearchIndex() now queues only active observation IDs into vectorMissingIds. Because persisted observations do not carry stored embeddings, startup hydration rebuilds the lexical index for all statuses, but only active observations are eligible for async vector backfill afterwards. That means status=resolved / status=all queries can become permanently asymmetric after restart: resolved/archived memories remain lexical-only while active memories regain hybrid/vector behavior.

So from the maintainer side my current read is:

  • startup availability: improved
  • active-only lexical index regression: fixed
  • remaining blocker: non-active observations still miss post-startup vector recovery

If you want to keep the lexical-first startup model, I think this needs one more pass so the backfill policy for non-active observations is explicit rather than an accidental regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Startup should not block MCP availability on full embedding rebuild

2 participants