From 73975ad59344fc82204125dabc5120dc5695898b Mon Sep 17 00:00:00 2001 From: Tyler Eveland Date: Sun, 29 Mar 2026 22:21:49 -0500 Subject: [PATCH 1/3] fix: harden Audrey lifecycle and recall diagnostics --- README.md | 1 + docs/plans/roadmap-status-2026-03-29.md | 47 ++++++++++++++++++++++ docs/production-readiness.md | 1 + mcp-server/index.js | 40 +++++++++++++++---- mcp-server/serve.js | 25 +++++++++--- src/audrey.js | 7 ++++ src/fts.js | 6 +-- src/recall.js | 53 ++++++++++++++----------- tests/audrey.test.js | 18 +++++++++ tests/mcp-server.test.js | 28 +++++++++++++ tests/multi-agent.test.js | 13 ++++++ tests/recall.test.js | 12 ++++++ tests/serve.test.js | 31 +++++++++++++++ types/index.d.ts | 11 ++++- 14 files changed, 252 insertions(+), 41 deletions(-) create mode 100644 docs/plans/roadmap-status-2026-03-29.md diff --git a/README.md b/README.md index 9502f97..79e062e 100644 --- a/README.md +++ b/README.md @@ -110,6 +110,7 @@ const memories = await brain.recall('stripe rate limits', { const dream = await brain.dream(); const briefing = await brain.greeting({ context: 'debugging stripe' }); +await brain.waitForIdle(); brain.close(); ``` diff --git a/docs/plans/roadmap-status-2026-03-29.md b/docs/plans/roadmap-status-2026-03-29.md new file mode 100644 index 0000000..4445251 --- /dev/null +++ b/docs/plans/roadmap-status-2026-03-29.md @@ -0,0 +1,47 @@ +# Audrey Roadmap Status - 2026-03-29 + +This note replaces stale assumptions from the earlier `codex.md` roadmap with the current repo state. + +## Current State + +- Multi-agent memory is already shipped. +- FTS-backed keyword search and hybrid retrieval are already shipped. +- TypeScript declarations are already shipped. +- REST API, dashboard, hooks integration, benchmarking, and CI are already shipped. + +The roadmap should no longer treat those as future phases. The highest-value work now is production correctness, operator clarity, and benchmark credibility. + +## Phase 0 Re-Evaluation + +Original bug list status: + +- `encode()` background work was tracked via `_pending`, but server and CLI shutdown paths still did not wait for that work to finish. +- `importMemories` snapshot validation is already in place. +- `recall()` degraded gracefully, but failure metadata was still too quiet for REST operators. +- Consolidation no longer uses raw `BEGIN IMMEDIATE`; it already uses `better-sqlite3` transactions. +- `parseBody` already guards against double-settle behavior. + +## This Pass + +- Added `Audrey.waitForIdle()` so production callers can drain tracked background work before shutdown or restore. +- Updated REST restore and process shutdown flows to wait for idle work before closing the database. +- Exposed `partialFailure` and `errors` on recall results and surfaced that metadata through the REST API. +- Fixed FTS keyword-search agent attribution so keyword-only multi-agent recall preserves the correct agent namespace. +- Added regression coverage for lifecycle draining, shutdown waiting, recall partial failures, and keyword-only multi-agent attribution. + +## Recommended Next Passes + +1. Clean the public docs and roadmap copy. + The current README and some planning docs still contain mojibake artifacts that hurt first contact. + +2. Make benchmark claims externally reproducible. + Add first-party LoCoMo and LongMemEval adapters under `memorybench` or fold them into this repo in a reproducible way. + +3. Tighten restore and import contracts. + Add explicit schema validation for snapshot versions and optional fields, then test malformed snapshots more aggressively. + +4. Improve operational visibility. + Add structured request logging and request IDs to the REST server, then expose recall failure counts in `/analytics`. + +5. Harden the SDK shutdown story. + Decide whether `close()` itself should eventually become async, or whether `waitForIdle()` remains the explicit graceful-shutdown contract. diff --git a/docs/production-readiness.md b/docs/production-readiness.md index d3c737b..009b1f2 100644 --- a/docs/production-readiness.md +++ b/docs/production-readiness.md @@ -62,6 +62,7 @@ Guardrails: 8. Keep API keys, bearer tokens, and raw credentials out of encoded memory content. 9. Decide whether `private` memories are allowed for your use case and document who can create them. 10. Add application-level encryption, access control, logging, and retention policies around Audrey. +11. On graceful shutdown paths, call `await brain.waitForIdle()` before `brain.close()` so tracked background work drains cleanly. ## Operations Commands diff --git a/mcp-server/index.js b/mcp-server/index.js index beaa2c5..8c66e8e 100644 --- a/mcp-server/index.js +++ b/mcp-server/index.js @@ -49,6 +49,13 @@ export async function initializeEmbeddingProvider(provider) { } } +async function closeAudreyGracefully(audrey) { + if (audrey && typeof audrey.waitForIdle === 'function') { + await audrey.waitForIdle(); + } + audrey?.close(); +} + export const memoryEncodeToolSchema = { content: z.string() .max(MAX_MEMORY_CONTENT_LENGTH) @@ -122,7 +129,7 @@ async function reembed() { const counts = await reembedAll(audrey.db, audrey.embeddingProvider, { dropAndRecreate: dimensionsChanged }); console.log(`Done. Re-embedded: ${counts.episodes} episodes, ${counts.semantics} semantics, ${counts.procedures} procedures`); } finally { - audrey.close(); + await closeAudreyGracefully(audrey); } } @@ -170,7 +177,7 @@ async function dream() { ); console.log('[audrey] Dream complete.'); } finally { - audrey.close(); + await closeAudreyGracefully(audrey); } } @@ -276,7 +283,7 @@ async function greeting() { console.log(lines.join('\n')); } finally { - audrey.close(); + await closeAudreyGracefully(audrey); } } @@ -352,7 +359,7 @@ async function reflect() { ); console.log('[audrey] Dream complete.'); } finally { - audrey.close(); + await closeAudreyGracefully(audrey); } } @@ -430,7 +437,7 @@ async function recall() { console.log(JSON.stringify(output)); } finally { - audrey.close(); + await closeAudreyGracefully(audrey); } } @@ -622,7 +629,7 @@ async function snapshot() { console.log(''); console.log('To restore: npx audrey restore ' + outputPath); } finally { - audrey.close(); + await closeAudreyGracefully(audrey); } } @@ -685,7 +692,7 @@ async function restore() { console.log(`[audrey] Restored: ${restored.episodic} episodes, ${restored.semantic} semantics, ${restored.procedural} procedures`); console.log('[audrey] Restore complete.'); } finally { - audrey.close(); + await closeAudreyGracefully(audrey); } } @@ -937,6 +944,25 @@ export function registerShutdownHandlers(processRef, audrey, logger = console.er } if (!closed) { closed = true; + if (typeof audrey?.waitForIdle === 'function') { + Promise.resolve(audrey.waitForIdle()) + .catch(err => { + logger(`[audrey-mcp] shutdown wait error: ${err.message || String(err)}`); + exitCode = exitCode === 0 ? 1 : exitCode; + }) + .finally(() => { + try { + audrey.close(); + } catch (err) { + logger(`[audrey-mcp] shutdown error: ${err.message || String(err)}`); + exitCode = exitCode === 0 ? 1 : exitCode; + } + if (typeof processRef.exit === 'function') { + processRef.exit(exitCode); + } + }); + return; + } try { audrey.close(); } catch (err) { diff --git a/mcp-server/serve.js b/mcp-server/serve.js index 9c3b058..c8f92f6 100644 --- a/mcp-server/serve.js +++ b/mcp-server/serve.js @@ -214,6 +214,13 @@ function route(method, pathname) { return `${method} ${pathname}`; } +async function drainAndCloseAudrey(audrey) { + if (audrey && typeof audrey.waitForIdle === 'function') { + await audrey.waitForIdle(); + } + audrey?.close(); +} + /** * Creates an HTTP server wrapping an Audrey instance. * @param {Audrey} audrey - The Audrey instance to serve @@ -312,7 +319,11 @@ export function createAudreyServer(audrey, options = {}) { const { query, ...opts } = body; if (requestAgent) opts.agent = requestAgent; const results = await ctx.audrey.recall(query, opts); - json(res, 200, { results }); + json(res, 200, { + results, + partialFailure: Boolean(results.partialFailure), + errors: results.errors ?? [], + }); break; } @@ -371,8 +382,8 @@ export function createAudreyServer(audrey, options = {}) { json(res, 501, { error: 'Restore not available: no audreyFactory configured' }); return; } - ctx.audrey.close(); const dbPath = ctx.audrey.db?.name; + await drainAndCloseAudrey(ctx.audrey); if (dbPath) { const dir = dirname(dbPath); for (const f of ['audrey.db', 'audrey.db-wal', 'audrey.db-shm']) { @@ -456,9 +467,13 @@ export async function startServer(options = {}) { const shutdown = () => { console.log('\n[audrey] Shutting down...'); - server._ctx.audrey.close(); - server.close(); - process.exit(0); + void drainAndCloseAudrey(server._ctx.audrey) + .catch(err => { + console.error('[audrey] Shutdown drain failed:', err.message); + }) + .finally(() => { + server.close(() => process.exit(0)); + }); }; process.on('SIGINT', shutdown); process.on('SIGTERM', shutdown); diff --git a/src/audrey.js b/src/audrey.js index 91158ee..9721b28 100644 --- a/src/audrey.js +++ b/src/audrey.js @@ -44,6 +44,7 @@ import { detectResonance } from './affect.js'; * @property {string} [before] * @property {Record} [context] * @property {{ valence?: number, arousal?: number }} [mood] + * @property {'hybrid' | 'vector' | 'keyword'} [retrieval] * * @typedef {Object} RecallResult * @property {string} id @@ -170,6 +171,12 @@ export class Audrey extends EventEmitter { promise.finally(() => this._pending.delete(promise)); } + async waitForIdle() { + while (this._pending.size > 0) { + await Promise.allSettled([...this._pending]); + } + } + _emitValidation(id, params) { const p = validateMemory(this.db, this.embeddingProvider, { id, ...params }, { llmProvider: this.llmProvider, diff --git a/src/fts.js b/src/fts.js index 3135846..ef5f4fc 100644 --- a/src/fts.js +++ b/src/fts.js @@ -55,7 +55,7 @@ export function searchFTSEpisodes(db, query, limit = 30, agentFilter = null) { const agentClause = agentFilter ? 'AND e.agent = ?' : ''; const params = agentFilter ? [query, agentFilter, limit] : [query, limit]; return db.prepare(` - SELECT f.id, f.content, bm25(fts_episodes) AS rank + SELECT f.id, f.content, e.agent, bm25(fts_episodes) AS rank FROM fts_episodes f JOIN episodes e ON e.id = f.id WHERE fts_episodes MATCH ? @@ -70,7 +70,7 @@ export function searchFTSSemantics(db, query, limit = 30, agentFilter = null) { const agentClause = agentFilter ? 'AND s.agent = ?' : ''; const params = agentFilter ? [query, agentFilter, limit] : [query, limit]; return db.prepare(` - SELECT f.id, f.content, bm25(fts_semantics) AS rank + SELECT f.id, f.content, s.agent, bm25(fts_semantics) AS rank FROM fts_semantics f JOIN semantics s ON s.id = f.id WHERE fts_semantics MATCH ? @@ -85,7 +85,7 @@ export function searchFTSProcedures(db, query, limit = 30, agentFilter = null) { const agentClause = agentFilter ? 'AND p.agent = ?' : ''; const params = agentFilter ? [query, agentFilter, limit] : [query, limit]; return db.prepare(` - SELECT f.id, f.content, bm25(fts_procedures) AS rank + SELECT f.id, f.content, p.agent, bm25(fts_procedures) AS rank FROM fts_procedures f JOIN procedures p ON p.id = f.id WHERE fts_procedures MATCH ? diff --git a/src/recall.js b/src/recall.js index da070a9..38520df 100644 --- a/src/recall.js +++ b/src/recall.js @@ -376,14 +376,7 @@ function knnProcedural(db, queryBuffer, candidateK, now, minConfidence, includeP return { results, matchedIds }; } -/** - * @param {import('better-sqlite3').Database} db - * @param {import('./embedding.js').EmbeddingProvider} embeddingProvider - * @param {string} query - * @param {{ minConfidence?: number, types?: string[], limit?: number, includeProvenance?: boolean, includeDormant?: boolean, tags?: string[], sources?: string[], after?: string, before?: string }} [options] - * @returns {AsyncGenerator<{ id: string, content: string, type: string, confidence: number, score: number, source: string, createdAt: string }>} - */ -export async function* recallStream(db, embeddingProvider, query, options = {}) { +async function runRecallQuery(db, embeddingProvider, query, options = {}) { const { minConfidence = 0, types, @@ -409,39 +402,39 @@ export async function* recallStream(db, embeddingProvider, query, options = {}) if (retrieval === 'keyword') { const ftsAvailable = hasFTSTables(db); if (!ftsAvailable) { - return; // No FTS tables, no keyword results + return { top: [], errors: [] }; } const sanitized = sanitizeFTSQuery(query); - if (!sanitized) return; + if (!sanitized) return { top: [], errors: [] }; const keywordResults = []; try { if (searchTypes.includes('episodic')) { for (const row of searchFTSEpisodes(db, sanitized, limit * 3, agentFilter)) { - keywordResults.push({ id: row.id, content: row.content, type: 'episodic', score: -row.rank, agent: 'default' }); + keywordResults.push({ id: row.id, content: row.content, type: 'episodic', score: -row.rank, agent: row.agent || 'default' }); } } if (searchTypes.includes('semantic')) { for (const row of searchFTSSemantics(db, sanitized, limit * 3, agentFilter)) { - keywordResults.push({ id: row.id, content: row.content, type: 'semantic', score: -row.rank, agent: 'default' }); + keywordResults.push({ id: row.id, content: row.content, type: 'semantic', score: -row.rank, agent: row.agent || 'default' }); } } if (searchTypes.includes('procedural')) { for (const row of searchFTSProcedures(db, sanitized, limit * 3, agentFilter)) { - keywordResults.push({ id: row.id, content: row.content, type: 'procedural', score: -row.rank, agent: 'default' }); + keywordResults.push({ id: row.id, content: row.content, type: 'procedural', score: -row.rank, agent: row.agent || 'default' }); } } } catch { // FTS query syntax error — fall through with whatever we have } keywordResults.sort((a, b) => b.score - a.score); - for (const entry of keywordResults.slice(0, limit)) { - entry.confidence = 1; - entry.source = 'keyword'; - entry.createdAt = now.toISOString(); - yield entry; - } - return; + const top = keywordResults.slice(0, limit).map(entry => ({ + ...entry, + confidence: 1, + source: 'keyword', + createdAt: now.toISOString(), + })); + return { top, errors: [] }; } const queryVector = await embeddingProvider.embed(query); @@ -546,6 +539,18 @@ export async function* recallStream(db, embeddingProvider, query, options = {}) } const top = applyResultGuards(query, allResults, limit); + return { top, errors }; +} + +/** + * @param {import('better-sqlite3').Database} db + * @param {import('./embedding.js').EmbeddingProvider} embeddingProvider + * @param {string} query + * @param {{ minConfidence?: number, types?: string[], limit?: number, includeProvenance?: boolean, includeDormant?: boolean, tags?: string[], sources?: string[], after?: string, before?: string }} [options] + * @returns {AsyncGenerator<{ id: string, content: string, type: string, confidence: number, score: number, source: string, createdAt: string }>} + */ +export async function* recallStream(db, embeddingProvider, query, options = {}) { + const { top, errors } = await runRecallQuery(db, embeddingProvider, query, options); for (const entry of top) { if (errors.length > 0) entry._recallErrors = errors; yield entry; @@ -560,9 +565,9 @@ export async function* recallStream(db, embeddingProvider, query, options = {}) * @returns {Promise>} */ export async function recall(db, embeddingProvider, query, options = {}) { - const results = []; - for await (const entry of recallStream(db, embeddingProvider, query, options)) { - results.push(entry); - } + const { top, errors } = await runRecallQuery(db, embeddingProvider, query, options); + const results = [...top]; + results.partialFailure = errors.length > 0; + results.errors = errors; return results; } diff --git a/tests/audrey.test.js b/tests/audrey.test.js index 70db26b..30eb8ee 100644 --- a/tests/audrey.test.js +++ b/tests/audrey.test.js @@ -107,6 +107,24 @@ describe('Audrey', () => { expect(emitted).toBe(true); }); + it('waitForIdle drains tracked background work', async () => { + let releasePending; + const pending = new Promise(resolve => { + releasePending = resolve; + }); + + brain._trackAsync(pending); + const wait = brain.waitForIdle(); + + await Promise.resolve(); + expect(brain._pending.size).toBe(1); + + releasePending(); + await wait; + + expect(brain._pending.size).toBe(0); + }); + it('runs consolidation', async () => { const result = await brain.consolidate(); expect(result).toHaveProperty('runId'); diff --git a/tests/mcp-server.test.js b/tests/mcp-server.test.js index 20c385a..c0748bb 100644 --- a/tests/mcp-server.test.js +++ b/tests/mcp-server.test.js @@ -261,6 +261,34 @@ describe('MCP lifecycle hardening', () => { expect(fakeProcess.exit).toHaveBeenCalledWith(0); }); + it('waits for pending Audrey work before exiting when waitForIdle is available', async () => { + const fakeProcess = new EventEmitter(); + fakeProcess.exit = vi.fn(); + + let releaseIdle; + const idle = new Promise(resolve => { + releaseIdle = resolve; + }); + const audrey = { + waitForIdle: vi.fn(() => idle), + close: vi.fn(), + }; + + registerShutdownHandlers(fakeProcess, audrey, vi.fn()); + fakeProcess.emit('SIGTERM'); + + expect(audrey.waitForIdle).toHaveBeenCalledOnce(); + expect(audrey.close).not.toHaveBeenCalled(); + expect(fakeProcess.exit).not.toHaveBeenCalled(); + + releaseIdle(); + await idle; + await new Promise(resolve => setImmediate(resolve)); + + expect(audrey.close).toHaveBeenCalledOnce(); + expect(fakeProcess.exit).toHaveBeenCalledWith(0); + }); + it('exits non-zero on unhandled rejections', () => { const fakeProcess = new EventEmitter(); fakeProcess.exit = vi.fn(); diff --git a/tests/multi-agent.test.js b/tests/multi-agent.test.js index 4d37450..b01e241 100644 --- a/tests/multi-agent.test.js +++ b/tests/multi-agent.test.js @@ -77,4 +77,17 @@ describe('multi-agent memory', () => { expect(results.length).toBe(0); audreyC.close(); }); + + it('keyword-only recall preserves agent attribution', async () => { + const results = await audreyA.recall('Alpha', { + limit: 10, + scope: 'agent', + retrieval: 'keyword', + }); + + expect(results.length).toBeGreaterThan(0); + for (const result of results) { + expect(result.agent).toBe('agent-alpha'); + } + }); }); diff --git a/tests/recall.test.js b/tests/recall.test.js index 15182ee..22ebaab 100644 --- a/tests/recall.test.js +++ b/tests/recall.test.js @@ -198,6 +198,18 @@ describe('recall', () => { expect(incremented).toBe(true); }); + it('surfaces partial failures when a recall path breaks', async () => { + db.exec('DROP TABLE vec_semantics'); + + const results = await recall(db, embedding, 'Stripe rate limit', { types: ['semantic'] }); + + expect(results).toHaveLength(0); + expect(results.partialFailure).toBe(true); + expect(results.errors).toEqual([ + expect.objectContaining({ type: 'semantic' }), + ]); + }); + // --- recallStream tests --- it('recallStream yields results as async generator', async () => { diff --git a/tests/serve.test.js b/tests/serve.test.js index 6dc1d4f..01cd80a 100644 --- a/tests/serve.test.js +++ b/tests/serve.test.js @@ -132,6 +132,37 @@ describe('Audrey REST API Server', () => { expect(res.data.error).toContain('query'); }); + it('POST /recall reports partial failures when a search path is unavailable', async () => { + const brokenDataDir = mkdtempSync(join(tmpdir(), 'audrey-serve-partial-')); + const brokenAudrey = new Audrey({ + dataDir: brokenDataDir, + agent: 'test-partial', + embedding: { provider: 'mock', dimensions: 64 }, + }); + const brokenServer = createAudreyServer(brokenAudrey); + await new Promise(resolve => brokenServer.listen(0, '127.0.0.1', resolve)); + + try { + brokenAudrey.db.exec('DROP TABLE vec_semantics'); + + const res = await request(brokenServer, 'POST', '/recall', { + query: 'server test memory', + types: ['semantic'], + }); + + expect(res.status).toBe(200); + expect(res.data.results).toEqual([]); + expect(res.data.partialFailure).toBe(true); + expect(res.data.errors).toEqual([ + expect.objectContaining({ type: 'semantic' }), + ]); + } finally { + brokenServer.close(); + brokenAudrey.close(); + rmSync(brokenDataDir, { recursive: true, force: true }); + } + }); + it('POST /dream runs consolidation cycle', async () => { const res = await request(server, 'POST', '/dream', {}); expect(res.status).toBe(200); diff --git a/types/index.d.ts b/types/index.d.ts index 3260c21..8e3db31 100644 --- a/types/index.d.ts +++ b/types/index.d.ts @@ -120,6 +120,7 @@ export interface RecallOptions { affect?: AffectParams; scope?: 'shared' | 'agent'; agent?: string; + retrieval?: 'hybrid' | 'vector' | 'keyword'; } export interface RecallResult { @@ -143,6 +144,11 @@ export interface RecallResult { _recallErrors?: Array<{ type: string; message: string }>; } +export type RecallResults = RecallResult[] & { + partialFailure?: boolean; + errors?: Array<{ type: string; message: string }>; +}; + // === Consolidation === export interface ConsolidateOptions { @@ -344,7 +350,7 @@ export class Audrey extends EventEmitter { constructor(config: AudreyConfig); encode(params: EncodeParams): Promise; - recall(query: string, options?: RecallOptions): Promise; + recall(query: string, options?: RecallOptions): Promise; recallStream(query: string, options?: RecallOptions): AsyncGenerator; consolidate(options?: ConsolidateOptions): Promise; dream(options?: DreamOptions): Promise; @@ -358,6 +364,7 @@ export class Audrey extends EventEmitter { reflect(turns: string): Promise; startAutoConsolidate(intervalMs: number, options?: ConsolidateOptions): void; stopAutoConsolidate(): void; + waitForIdle(): Promise; close(): void; } @@ -369,7 +376,7 @@ export function readStoredDimensions(dataDir: string): number | null; // === Standalone Functions === -export function recall(db: Database, embeddingProvider: EmbeddingProvider, query: string, options?: RecallOptions): Promise; +export function recall(db: Database, embeddingProvider: EmbeddingProvider, query: string, options?: RecallOptions): Promise; export function recallStream(db: Database, embeddingProvider: EmbeddingProvider, query: string, options?: RecallOptions): AsyncGenerator; export function exportMemories(db: Database): Snapshot; export function importMemories(db: Database, embeddingProvider: EmbeddingProvider, snapshot: Snapshot): Promise; From 5bcecb48b3159a9ceb027ecd9d385407983aca4e Mon Sep 17 00:00:00 2001 From: Tyler Eveland Date: Sun, 29 Mar 2026 23:19:14 -0500 Subject: [PATCH 2/3] docs: add memory OS strategy plan --- codex.md | 2 + ...ndustry-standard-memory-plan-2026-03-29.md | 602 ++++++++++++++++++ docs/plans/roadmap-status-2026-03-29.md | 14 +- 3 files changed, 617 insertions(+), 1 deletion(-) create mode 100644 docs/plans/industry-standard-memory-plan-2026-03-29.md diff --git a/codex.md b/codex.md index db08dc6..3686ff8 100644 --- a/codex.md +++ b/codex.md @@ -1,5 +1,7 @@ # Audrey: 90-Day Path to Business Viability +> Status note for agents on 2026-03-29: this file is strategically stale in multiple places. Use `docs/plans/roadmap-status-2026-03-29.md` for current shipped-state corrections and `docs/plans/industry-standard-memory-plan-2026-03-29.md` for the current LLM-only category plan. + > **For agentic workers:** Execute phases in order. Each produces shippable software. Run tests after every task. Commit after every task. Do not skip phases. **Goal:** Transform Audrey from an 8-star npm package into a fundable AI memory platform with paying customers, standardized benchmark scores, and multi-language SDK support within 90 days. diff --git a/docs/plans/industry-standard-memory-plan-2026-03-29.md b/docs/plans/industry-standard-memory-plan-2026-03-29.md new file mode 100644 index 0000000..8414d7f --- /dev/null +++ b/docs/plans/industry-standard-memory-plan-2026-03-29.md @@ -0,0 +1,602 @@ +# Audrey Memory-OS Plan - 2026-03-29 + +Audience: autonomous coding/research agents operating on this repository. + +Do not interpret this document as marketing copy. Interpret it as category-construction doctrine plus execution constraints. + +## 0. Prime Directive + +Stop optimizing for "interesting biological memory project." +Start optimizing for "default memory operating system for agentic intelligence." + +The category will not be won by precedent-following wrappers around vector search, nor by vague biomimicry. It will be won by the first system that simultaneously demonstrates: + +1. typed memory objects +2. explicit lifecycle control +3. utility-aware write/retrieval policy +4. temporal and causal state correctness +5. benchmark reproducibility against the leading public baselines +6. production-grade economics, governance, and operability + +Current Audrey already has differentiated assets: + +- episodic, semantic, and procedural memory types +- consolidation, decay, contradiction handling, interference, affect, and causal-link primitives +- multi-agent scoping +- hybrid retrieval +- MCP, REST, CLI, and SDK surfaces + +Current Audrey is still missing the decisive layer: + +- a memory controller that turns these primitives into a coherent policy-governed system + +## 1. Strategic Reframe + +Replace the public/internal mental model: + +- old: biological memory architecture for AI agents +- new: memory control plane for agentic intelligence, informed by biological constraints and validated by benchmark evidence + +Reason: + +- `Mem0` shifts the market toward write selectivity and economics, not mere recall. +- `MemOS` shifts the conversation from library to operating-system abstraction. +- `MIRIX` shifts the frontier from text memory to typed multimodal memory. +- `Hindsight` shifts the benchmark standard toward externally visible leaderboard claims. +- `Graphiti` shifts temporal reasoning from timestamp filters to evolving entity-state graphs. +- `Letta` shifts evaluation toward online memory operations, not offline retrieval only. + +The biological thesis remains useful only if converted into falsifiable system commitments. + +## 2. Research-Constrained Design Rules + +### 2.1 LLM-memory literature -> mandatory system behavior + +`Mem0` (https://arxiv.org/abs/2504.19413) + +- Mandatory inference: writes must be selective and cost-accounted. +- Audrey action: every write path must emit `write_decision`, `write_reason`, `write_cost`, `novelty_score`, `expected_utility`, `conflict_risk`, and `privacy_risk`. + +`MemOS` (https://arxiv.org/abs/2507.03724) + +- Mandatory inference: memory must be lifecycle-managed as a first-class system substrate. +- Audrey action: centralize write/promote/compress/reconsolidate/archive/evict policy in a controller layer instead of scattering it across `encode`, `consolidate`, `decay`, and ad hoc background tasks. + +`MIRIX` (https://arxiv.org/abs/2507.07957) + +- Mandatory inference: typed multimodal memory is now frontier-normal. +- Audrey action: add first-class resource/artifact memory envelopes for files, screenshots, URLs, structured tool outputs, tables, and attachments. + +`EverMemOS` (https://arxiv.org/abs/2601.02163) + +- Mandatory inference: useful memory systems require atomic cells, scene-level composition, and reconstructive recollection. +- Audrey action: insert an intermediate hierarchy between episodes and semantic principles. + +`MemRL` (https://arxiv.org/abs/2601.03192) + +- Mandatory inference: semantic similarity is an insufficient terminal scorer; utility must be learned from outcomes. +- Audrey action: separate candidate generation from policy ranking. Rank memories by predicted downstream utility under task context. + +`MAGMA` (https://arxiv.org/abs/2601.03236) + +- Mandatory inference: a single retrieval path is structurally suboptimal. +- Audrey action: route queries into semantic, temporal, causal, entity, procedural, and conflict-resolution sub-pipelines before fusion. + +`LongMemEval` (https://arxiv.org/abs/2410.10813) + +- Mandatory inference: external proof must include multi-session reasoning, temporal reasoning, knowledge updates, and abstention. +- Audrey action: make real LongMemEval execution part of Audrey's release gate. + +`LoCoMo` (https://github.com/snap-research/locomo) + +- Mandatory inference: long-horizon conversational memory requires externally comparable evaluation traces. +- Audrey action: add a first-party LoCoMo adapter with frozen prompts, model configs, and artifact manifests. + +`Hindsight` (https://arxiv.org/abs/2512.12818) + +- Mandatory inference: public SOTA claims matter because they define who is taken seriously. +- Audrey action: treat Hindsight as the near-term benchmark rival to beat on LongMemEval/LoCoMo style tasks. + +`Letta benchmark write-up` (https://www.letta.com/blog/benchmarking-ai-agent-memory) + +- Mandatory inference: memory must be graded on operations, not only recall. +- Audrey action: add read/write/update/overwrite/delete/merge/abstain benchmark tracks. + +`Graphiti` (https://github.com/getzep/graphiti and https://blog.getzep.com/beyond-static-knowledge-graphs/) + +- Mandatory inference: temporal state changes need explicit graph semantics. +- Audrey action: replace timestamp-only reasoning with validity intervals, state transitions, and evolving entity-property edges. + +### 2.2 Neuroscience -> mandatory controller behavior + +`Deconstruction of a memory engram reveals distinct ensembles recruited at learning` (Nature Neuroscience, March 11, 2026: https://www.nature.com/articles/s41593-026-02230-2) + +- Mandatory inference: a memory episode should not be treated as a uniform blob. +- Audrey action: segment writes into phase-specific trace fragments (`prelude`, `salient event`, `outcome`, `response`) and maintain a "core recall subset" distinct from peripheral context. + +`Formation of an expanding memory representation in the hippocampus` (Nature Neuroscience, June 4, 2025: https://www.nature.com/articles/s41593-025-01986-3) + +- Mandatory inference: stability is accrued through reactivation, not assumed at write time. +- Audrey action: add a stability state variable that increases when retrieval proves useful and decreases under interference/conflict. + +`Goal-specific hippocampal inhibition gates learning` (Nature, April 9, 2025: https://www.nature.com/articles/s41586-025-08868-5) + +- Mandatory inference: plasticity should spike around goal-relevant states, not across all experience. +- Audrey action: detect goals, commitments, failures, corrections, and rewards; use these as write-gate amplifiers. + +`Systems consolidation reorganizes hippocampal engram circuitry` (Nature, May 14, 2025: https://www.nature.com/articles/s41586-025-08993-1) + +- Mandatory inference: episodic precision and semantic gist should co-exist and re-balance over time. +- Audrey action: maintain parallel episodic and schema layers with deliberate migration policies rather than accidental summarization. + +`Sleep microstructure organizes memory replay` (Nature, January 1, 2025: https://www.nature.com/articles/s41586-024-08340-w) + +- Mandatory inference: replay should be partitioned into substates to reduce interference. +- Audrey action: split background replay into `recent-fragile`, `schema-refresh`, `conflict-repair`, and `garbage-collection` jobs with different budgets. + +`Post-learning replay of hippocampal-striatal activity is biased by reward-prediction signals` (Nature Communications, November 24, 2025: https://www.nature.com/articles/s41467-025-65354-2) + +- Mandatory inference: replay priority should be driven by surprise and value delta, not by salience alone. +- Audrey action: prioritize corrections, failed tool trajectories, preference flips, and unexpected outcomes. + +`Hippocampal output suppresses orbitofrontal cortex schema cell formation` (Nature Neuroscience, April 14, 2025: https://www.nature.com/articles/s41593-025-01928-z) + +- Mandatory inference: over-serving episodic detail can block schema induction. +- Audrey action: throttle episode-heavy recall when repeated structure is detected; force schema extraction passes. + +`Constructing future behavior in the hippocampal formation through composition and replay` (Nature Neuroscience, March 10, 2025: https://www.nature.com/articles/s41593-025-01908-3) + +- Mandatory inference: reusable primitives plus replay support generalization into novel tasks. +- Audrey action: factor memories into entities, tools, constraints, places, roles, and workflows; reconstruct scenes from those primitives at recall time. + +`Synaptic plasticity rules driving representational shifting in the hippocampus` (Nature Neuroscience, March 20, 2025: https://www.nature.com/articles/s41593-025-01894-6) + +- Mandatory inference: memory updates should be sparse, novelty-sensitive, and high-threshold. +- Audrey action: most recalls must not rewrite memory. Reconsolidation should require controller approval. + +`Theta-encoded information flow from dorsal CA1 to prelimbic cortex drives memory reconsolidation` (iScience, June 4, 2025: https://doi.org/10.1016/j.isci.2025.112821) + +- Mandatory inference: reconsolidation requires a window, not an unconditional rewrite path. +- Audrey action: only permit write-back after recall when contradiction pressure, novelty, confidence shift, and evidence support exceed threshold. + +`Exploring the neural underpinnings of semantic and perceptual false memory formation` (NeuroImage, January 30, 2026: https://pubmed.ncbi.nlm.nih.gov/41308786/) + +- Mandatory inference: semantic overlap and source-grounded recall are separable failure modes. +- Audrey action: separate semantic-match confidence from provenance-match confidence and increase abstention when they diverge. + +## 3. What Audrey Is Still Missing + +### 3.1 Control-plane gap + +Current repo state exposes high-quality primitives but still routes behavior through direct method calls: + +- `encode` +- `recall` +- `consolidate` +- `dream` +- `decay` +- `validate` + +Missing abstraction: + +- `MemoryController` +- `PolicyEngine` +- `ReplayScheduler` +- `ReconsolidationGate` +- `RetentionManager` +- `ObservationBus` + +### 3.2 Typed memory-object gap + +Current types are too coarse: + +- episodic +- semantic +- procedural + +Required type surface: + +- `trace`: raw event fragment +- `cell`: atomic memory unit extracted from one or more traces +- `scene`: compositional situation model +- `schema`: generalized reusable abstraction +- `procedure`: executable behavioral policy +- `entity_state`: time-varying property/value memory +- `causal_link`: cause/effect or mechanism edge +- `resource`: external artifact reference with modality metadata +- `working_set`: task-bounded short-horizon active memory +- `quarantined`: low-trust or poison-suspect memory object + +### 3.3 Temporal-state gap + +Current temporal handling is primarily: + +- timestamps +- before/after filtering +- recency-weighted scoring + +Required representation: + +- `subject` +- `predicate` +- `object/value` +- `valid_from` +- `valid_to` +- `observed_at` +- `superseded_by` +- `confidence` +- `source` +- `scope` + +Without this, Audrey cannot credibly own "what was true when" reasoning. + +### 3.4 Utility-learning gap + +Current `usage_count` and `last_used_at` are instrumentation, not policy. + +Required additions: + +- implicit reward signals from successful downstream task completion +- negative signals from bad recalls, contradictions, user corrections, and abstentions +- a learned or heuristically trained value estimator for write and retrieval ranking +- value-aware consolidation and value-aware forgetting + +### 3.5 Resource-memory gap + +Audrey currently reads as text-memory plus metadata. + +Required additions: + +- artifact envelopes with modality and extractor metadata +- per-modality embedding/extraction backends +- artifact-grounded recall fusion +- provenance links from textual abstractions back to original artifacts + +### 3.6 Benchmark-proof gap + +Current benchmarking is good internal hygiene. It is not yet category-defining proof. + +Required public proof: + +- first-party reproducible LongMemEval +- first-party reproducible LoCoMo +- operation-level memory benchmark +- cost/latency/storage curves +- biological-mechanism ablations +- long-context comparison under equal budget +- third-party replication path + +## 4. Non-Negotiable Architecture Changes + +### 4.1 Add a controller layer + +Create: + +- `src/controller.js` +- `src/policy.js` +- `src/replay.js` +- `src/reconsolidate.js` +- `src/state-model.js` + +Controller responsibilities: + +- classify incoming observations +- decide write/no-write/defer/quarantine +- choose memory target type +- schedule replay/consolidation/reindexing +- manage retention and eviction +- manage reconsolidation after recall +- emit structured telemetry for all decisions + +No direct path should persist or mutate memory without a controller decision record. + +### 4.2 Introduce a hierarchy + +Mandatory hierarchy: + +1. `trace` + fine-grained event fragment, immutable +2. `cell` + atomic claim/intent/preference/tool outcome +3. `scene` + compositional event/task model +4. `schema` + abstract reusable pattern +5. `procedure` + executable policy or workflow + +Current `episode` maps closest to a mixture of `trace` and `scene`. Split it. + +### 4.3 Add query-intent routing + +Before retrieval, classify query into one or more intents: + +- fact lookup +- user preference +- temporal query +- causal query +- conflict resolution +- procedure recall +- entity state query +- artifact lookup +- schema/generalization query + +Then route into specialized sub-indexes: + +- vector semantic +- lexical exact-match +- temporal state graph +- causal graph +- entity index +- procedure index +- artifact index + +Fusion should occur after route-specific ranking, not before. + +### 4.4 Add reconsolidation discipline + +Retrieval must not automatically mutate memory. + +Mandatory reconsolidation preconditions: + +- recall confidence changed materially +- contradiction or correction pressure exists +- provenance support is sufficient +- query context matches the original scope well enough +- no poison/quarantine block is active + +All reconsolidation must preserve lineage: + +- parent versions +- merge/split history +- supersession graph +- reason code + +### 4.5 Add quarantine and source policy + +Low-trust memory must be segregated. + +Required policy fields: + +- source trust tier +- privacy classification +- tenant scope +- poison risk +- verification state +- approval requirement + +Required actions: + +- quarantine +- require-human-approval +- require-second-source +- soft-store-with-abstain-only + +## 5. Proof Stack Required For Category Leadership + +### 5.1 External benchmark program + +Implement: + +- `benchmarks/external/longmemeval/` +- `benchmarks/external/locomo/` +- `benchmarks/external/operations/` +- `benchmarks/external/cost/` +- `benchmarks/external/ablations/` + +Release gate must publish: + +- dataset version +- prompt templates +- model version +- embedding version +- hardware/runtime profile +- raw outputs +- scoring script version +- summary tables + +### 5.2 Ablation matrix + +Audrey cannot claim a biological advantage unless each mechanism can be toggled and measured. + +Required ablations: + +- no consolidation +- no decay +- no contradiction handling +- no provenance-aware abstention +- no affect/context weighting +- no replay scheduler +- no utility scorer +- no temporal state graph +- no causal retrieval boost + +Evaluate each on: + +- LongMemEval capability breakdown +- LoCoMo +- operation benchmark +- cost/latency/storage overhead +- false-memory rate + +### 5.3 Long-context comparison + +Mandatory comparison groups: + +- brute-force long-context baseline +- vector-only baseline +- hybrid lexical+vector baseline +- Hindsight-style retain/recall/reflect baseline +- Audrey full system + +Compare under: + +- equal token budget +- equal wall-clock budget +- equal update frequency + +Required message: + +- Audrey is not just more "biological" +- Audrey is better under change, cheaper to update, and safer to trust + +## 6. Execution Order + +### Phase A: Benchmark legitimacy first + +Why first: + +- without external proof, architecture work remains easy to dismiss + +Tasks: + +1. implement real LongMemEval adapter +2. implement real LoCoMo adapter +3. add artifact manifests and frozen run configs +4. add operations benchmark for update/overwrite/delete/merge/abstain +5. publish cost curves against long-context and simple memory baselines + +Exit criteria: + +- Audrey can run `npm run bench:external` +- results are reproducible on a clean machine +- README can truthfully present external benchmark numbers + +### Phase B: Memory controller and typed object migration + +Tasks: + +1. add controller layer +2. split episode into trace/cell/scene +3. add lifecycle state machine +4. make all mutations controller-mediated +5. emit structured decision telemetry + +Exit criteria: + +- no write path bypasses controller +- every memory object carries lifecycle and provenance metadata + +### Phase C: Temporal + causal + entity-state retrieval + +Tasks: + +1. add entity-state tables with validity windows +2. add query router +3. integrate causal links into recall ranking +4. expose state-history queries over REST/MCP/SDK + +Exit criteria: + +- Audrey answers "what was true when" from state memory, not text search +- causal queries outperform hybrid text retrieval baselines + +### Phase D: Utility learning and replay scheduling + +Tasks: + +1. convert `usage_count` into reward signals +2. learn or heuristically update utility scores +3. partition replay into recent-fragile, schema-refresh, conflict-repair, and garbage-collection queues +4. use surprise and value delta to prioritize replay + +Exit criteria: + +- measured lift from utility-aware ranking +- replay budget measurably improves benchmark outcomes + +### Phase E: Resource/multimodal memory + +Tasks: + +1. add `resource` memory type +2. persist artifact metadata and references +3. attach extractor outputs to resources +4. support retrieval plans that fuse artifact and textual memories + +Exit criteria: + +- Audrey can ground answers in files/tool outputs/artifacts, not just text memories + +### Phase F: Governance and neutral trust + +Tasks: + +1. tenant isolation +2. audit log +3. retention/erasure enforcement +4. encryption integration hooks +5. third-party evaluation harness and replication guide + +Exit criteria: + +- enterprise objections shift from "is this serious?" to procurement and adoption questions + +## 7. File-Level Starting Points In This Repo + +Exploit existing assets instead of rewriting the system from scratch. + +Primary surfaces: + +- `src/audrey.js` +- `src/recall.js` +- `src/db.js` +- `src/consolidate.js` +- `src/decay.js` +- `src/causal.js` +- `src/confidence.js` +- `src/interference.js` +- `src/affect.js` +- `src/import.js` +- `benchmarks/run.js` +- `benchmarks/cases.js` +- `docs/benchmarking.md` +- `mcp-server/serve.js` +- `mcp-server/index.js` + +Recommended insertion points: + +- controller hooks around `encode`, `recall`, `consolidate`, `dream` +- schema changes in `src/db.js` +- benchmark adapters under `benchmarks/external` +- telemetry surfaces through REST `/analytics` and MCP status outputs + +## 8. Do Not Waste Cycles On These Failure Modes + +- do not spend another major cycle polishing README rhetoric without new proof +- do not present internal synthetic benchmarks as category-defining evidence +- do not add more memory "types" without a controller and routing policy +- do not overfit to single-vector similarity improvements +- do not let retrieval mutate stored memory by default +- do not keep calling the system "biological" unless the mechanism is measurable + +## 9. Category-Winning Claim Audrey Should Eventually Earn + +Not current claim. Target claim. + +"Audrey is the first reproducibly benchmarked memory operating system for agents: typed, lifecycle-managed, utility-aware, temporally correct, causally grounded, and production-economical." + +Do not claim this before the proof stack exists. + +## 10. Immediate Next Moves + +Execute in this order: + +1. external benchmark adapters +2. ablation toggles for existing biological mechanisms +3. controller-layer scaffold +4. typed trace/cell/scene schema migration design +5. temporal entity-state model +6. utility-aware ranking +7. replay scheduler +8. resource memory + +If an implementation choice does not improve one of: + +- benchmark legitimacy +- controller coherence +- temporal correctness +- utility learning +- governance/economics + +it is probably not on the critical path. diff --git a/docs/plans/roadmap-status-2026-03-29.md b/docs/plans/roadmap-status-2026-03-29.md index 4445251..841552e 100644 --- a/docs/plans/roadmap-status-2026-03-29.md +++ b/docs/plans/roadmap-status-2026-03-29.md @@ -2,6 +2,8 @@ This note replaces stale assumptions from the earlier `codex.md` roadmap with the current repo state. +Canonical next-step strategy now lives in `docs/plans/industry-standard-memory-plan-2026-03-29.md`. + ## Current State - Multi-agent memory is already shipped. @@ -35,7 +37,7 @@ Original bug list status: The current README and some planning docs still contain mojibake artifacts that hurt first contact. 2. Make benchmark claims externally reproducible. - Add first-party LoCoMo and LongMemEval adapters under `memorybench` or fold them into this repo in a reproducible way. + Add first-party LoCoMo and LongMemEval adapters under `memorybench` or fold them into this repo in a reproducible way. This is now the top proof-stack requirement in `industry-standard-memory-plan-2026-03-29.md`. 3. Tighten restore and import contracts. Add explicit schema validation for snapshot versions and optional fields, then test malformed snapshots more aggressively. @@ -45,3 +47,13 @@ Original bug list status: 5. Harden the SDK shutdown story. Decide whether `close()` itself should eventually become async, or whether `waitForIdle()` remains the explicit graceful-shutdown contract. + +## Strategic Reframe + +The next competitive frame should be "memory control plane / memory OS" rather than "memory library with biological inspiration". The repo now has enough primitives to justify that direction, but it still needs: + +- real external benchmark proof +- controller-mediated lifecycle policy +- temporal/entity-state memory +- utility-aware replay and ranking +- typed resource memory From fb81177de79adc6b390b14edf68dc3ad0dc0ff58 Mon Sep 17 00:00:00 2001 From: Tyler Eveland Date: Mon, 30 Mar 2026 08:51:59 -0500 Subject: [PATCH 3/3] docs: mark strategy plan as canonical --- docs/plans/industry-standard-memory-plan-2026-03-29.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/plans/industry-standard-memory-plan-2026-03-29.md b/docs/plans/industry-standard-memory-plan-2026-03-29.md index 8414d7f..cceec6c 100644 --- a/docs/plans/industry-standard-memory-plan-2026-03-29.md +++ b/docs/plans/industry-standard-memory-plan-2026-03-29.md @@ -3,6 +3,7 @@ Audience: autonomous coding/research agents operating on this repository. Do not interpret this document as marketing copy. Interpret it as category-construction doctrine plus execution constraints. +When this plan conflicts with older roadmap prose, this plan wins. ## 0. Prime Directive