Enhance health_check tool with always-on counts, chunks check, and embedding reachability#72
Conversation
…bedding reachability - Return row counts unconditionally (not just in verbose mode) so agents get actionable diagnostics in a single call - Add chunks table check with count - Add memory_observations count alongside entities - Add proactive_insights count - Add embedding service reachability check (tries generateEmbedding) - Add provider field to embeddings check output - Update Zod output schema with observations and provider fields - Add tests for new functionality (16 total, up from 13) https://claude.ai/code/session_01JtWxbWQKXXxmE51EM7dmJh
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughHealth checks now invoke Sequence Diagram(s)sequenceDiagram
participant Client
participant HealthSvc as HealthService
participant DB as Database
participant Emb as EmbeddingProvider
Client->>HealthSvc: GET /health?verbose=(true|false)
HealthSvc->>DB: check DB connection & checkTable(documents, chunks, memory, conversations, insights)
DB-->>HealthSvc: table existence / accessibility
alt table accessible
alt verbose
HealthSvc->>DB: queryCount("SELECT count(*) FROM <table>")
else not verbose
HealthSvc->>DB: estimateRowCount(<table>)
end
DB-->>HealthSvc: count or estimate
else inaccessible
DB-->>HealthSvc: error -> HealthSvc marks check ok:false
end
HealthSvc->>DB: queryOne('chunks_pending') for insightQueue (non-verbose also)
DB-->>HealthSvc: pending count or error
alt embeddings configured
HealthSvc->>Emb: generateEmbedding("health check")
alt reachable
Emb-->>HealthSvc: vector -> embeddings.ok:true, include provider
else unreachable
Emb-->>HealthSvc: error -> embeddings.ok:false, set degraded
end
end
HealthSvc-->>Client: aggregated health JSON with checks: documents, chunks, memory, conversations, insights, insightQueue, embeddings
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tools/__tests__/health.test.ts`:
- Around line 218-228: The test "returns degraded when embedding service is
unreachable" asserts embeddings.ok and embeddings.error but doesn't assert the
overall status; update the test (in the it block that calls callHealth(false)
and inspects result.structuredContent) to also expect
result.structuredContent?.status === 'degraded' (assuming you fixed the
implementation to set hasFailure on embedding failure); alternatively, if you
don't change the implementation, rename the test to "reports embeddings failure
when service is unreachable" to match the current assertions.
In `@src/tools/health.ts`:
- Around line 108-125: The embedding failure branch currently sets
checks.embeddings.ok = false but doesn't update the overall health flag; modify
the catch in the generateEmbedding call so it also sets the aggregate failure
marker (the same variable used elsewhere to compute status, e.g., hasFailure) to
true when an embedding error occurs, ensuring overall status becomes degraded;
update the catch that alters checks.embeddings.error and checks.embeddings.ok to
also set hasFailure = true (or the equivalent overall-status flag) so the health
endpoint reflects the embedding service outage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7e778c30-5cb9-4899-8196-c820723ba1c9
📒 Files selected for processing (2)
src/tools/__tests__/health.test.tssrc/tools/health.ts
The catch block for generateEmbedding and the !embeddingsConfigured branch were not setting hasFailure = true, so an embedding service outage would leave overall status as 'healthy' instead of 'degraded'. Also added the missing status assertion in the corresponding test. https://claude.ai/code/session_01JtWxbWQKXXxmE51EM7dmJh
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tools/health.ts (1)
233-249:⚠️ Potential issue | 🟠 MajorMark
insightQueuefailures as degraded.Unlike the other component checks, neither failure path updates
hasFailure, so the tool can still returnhealthywhilechecks.insightQueue.okisfalse.🔧 Proposed fix
if (!queueOk) { checks.insightQueue.error = "Table 'insight_queue' not accessible"; + hasFailure = true; } else { const row = await queryOne<{ chunks_pending: number }>( 'SELECT chunks_pending FROM insight_queue WHERE id = 1', ); @@ } catch (err) { checks.insightQueue = { ok: false, error: err instanceof Error ? err.message : 'Check failed', }; + hasFailure = true; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/health.ts` around lines 233 - 249, The insight_queue branch doesn't set the global failure flag so the service can be considered healthy despite checks.insightQueue.ok being false; after detecting a bad table (when queueOk is false) and inside the catch block where checks.insightQueue is set to ok: false, set the same failure marker used elsewhere (hasFailure = true) so the overall health is marked degraded/failing; update the logic around checkTable('insight_queue') and the catch handling that assigns checks.insightQueue to also flip hasFailure to true, referencing checks.insightQueue, hasFailure, checkTable and queryOne to locate the code to change.
🧹 Nitpick comments (1)
src/tools/health.ts (1)
139-139: Consider estimated counts on the default path.These exact
COUNT(*)probes now run on every health check. On largerdocuments,chunks,memory_*, andproactive_insightstables, that makes the diagnostic path itself expensive. If always-on counts are required, I'd return catalog estimates by default and reserve exact counts forverbose.Based on learnings: The
health_checktool is always registered and provides per-component diagnostics. Use it as the first step when diagnosing infrastructure issues.Also applies to: 157-157, 176-179, 199-201, 221-221
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/health.ts` at line 139, The health check currently runs expensive exact COUNT(*) queries (e.g., queryCount used to set checks.documents.count) on every run; change the default path to use inexpensive estimated counts (PG catalog reltuples or equivalent) for the large tables (documents, chunks, memory_* tables, proactive_insights) and only execute exact queryCount('SELECT count(*) ...') when the health check is invoked with verbose mode. Update the assignments at checks.documents.count and the analogous spots (the other occurrences you noted) to branch on the verbose flag (use an estimate helper or lightweight catalog query when !verbose, and call queryCount for exact numbers when verbose is true) so diagnostics remain cheap by default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/tools/health.ts`:
- Around line 233-249: The insight_queue branch doesn't set the global failure
flag so the service can be considered healthy despite checks.insightQueue.ok
being false; after detecting a bad table (when queueOk is false) and inside the
catch block where checks.insightQueue is set to ok: false, set the same failure
marker used elsewhere (hasFailure = true) so the overall health is marked
degraded/failing; update the logic around checkTable('insight_queue') and the
catch handling that assigns checks.insightQueue to also flip hasFailure to true,
referencing checks.insightQueue, hasFailure, checkTable and queryOne to locate
the code to change.
---
Nitpick comments:
In `@src/tools/health.ts`:
- Line 139: The health check currently runs expensive exact COUNT(*) queries
(e.g., queryCount used to set checks.documents.count) on every run; change the
default path to use inexpensive estimated counts (PG catalog reltuples or
equivalent) for the large tables (documents, chunks, memory_* tables,
proactive_insights) and only execute exact queryCount('SELECT count(*) ...')
when the health check is invoked with verbose mode. Update the assignments at
checks.documents.count and the analogous spots (the other occurrences you noted)
to branch on the verbose flag (use an estimate helper or lightweight catalog
query when !verbose, and call queryCount for exact numbers when verbose is true)
so diagnostics remain cheap by default.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7485f842-cc11-441a-9b8c-318e808b2065
📒 Files selected for processing (2)
src/tools/__tests__/health.test.tssrc/tools/health.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/tools/tests/health.test.ts
- Set hasFailure = true when insight_queue table check fails or throws, matching the pattern used by every other table check - Use pg_class.reltuples for cheap estimated row counts in default mode; only run exact count(*) queries when verbose=true - Add estimateRowCount helper to health-helpers.ts - Add tests for insight_queue failure propagation and estimated vs exact count behavior (18 tests total, up from 16) https://claude.ai/code/session_01JtWxbWQKXXxmE51EM7dmJh
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/utils/health-helpers.ts (2)
40-43: Query may return rows from multiple schemas.The
relnamelookup doesn't include a schema filter, so if tables with identical names exist in different schemas (e.g.,public.documentsandarchive.documents), the query returns multiple rows androws[0]picks one non-deterministically.For the current health-check use case with well-known table names this is unlikely to cause issues, but consider joining with
pg_namespaceif stricter schema scoping is ever needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/health-helpers.ts` around lines 40 - 43, The query in health-helpers uses pgQuery to select reltuples by relname, which can return multiple rows when identical table names exist across schemas; update the query in the function using tableName to scope by schema (e.g., join pg_namespace and filter on nspname) or require/accept a schema-qualified identifier and use a schema-aware lookup (join pg_namespace on pg_class.relnamespace or use to_regclass with schema-qualified name) so the row returned is deterministic and references the correct table.
38-47: Consider adding try/catch for consistency withcheckTable.Unlike
checkTable(lines 12-17), this function doesn't wrap thepgQuerycall in a try/catch. IfpgQuerythrows (connection error, permission denied), the exception propagates to the caller where it becomes a generic "Check failed" message, losing the specific error context.Additionally,
checkTabledefensively callsisDatabaseConfigured()at line 10, but this function doesn't—though the caller inhealth.tsdoes gate ondbOk, adding the check here would be more consistent.♻️ Suggested refactor for consistency
/** Estimated row count from pg_class (cheap, no table scan). */ export async function estimateRowCount(tableName: string): Promise<number> { + if (!isDatabaseConfigured()) return -1; if (!SAFE_IDENTIFIER.test(tableName)) return -1; + try { const { rows } = await pgQuery<{ reltuples: string }>( 'SELECT reltuples FROM pg_class WHERE relname = $1', [tableName], ); if (!rows[0]) return -1; const est = Number(rows[0].reltuples); return est < 0 ? 0 : Math.round(est); + } catch { + return -1; + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/health-helpers.ts` around lines 38 - 47, The estimateRowCount function should mirror checkTable's defensive behavior: first call isDatabaseConfigured() and return -1 if it is false, and wrap the pgQuery invocation in a try/catch so DB errors don't leak as generic failures; on catch return -1 and log the error with context (use the same logger pattern as checkTable) so callers still get -1 but the real error is recorded. Ensure you update the estimateRowCount function (referencing pgQuery and isDatabaseConfigured) to implement this try/catch and early-config check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/utils/health-helpers.ts`:
- Around line 40-43: The query in health-helpers uses pgQuery to select
reltuples by relname, which can return multiple rows when identical table names
exist across schemas; update the query in the function using tableName to scope
by schema (e.g., join pg_namespace and filter on nspname) or require/accept a
schema-qualified identifier and use a schema-aware lookup (join pg_namespace on
pg_class.relnamespace or use to_regclass with schema-qualified name) so the row
returned is deterministic and references the correct table.
- Around line 38-47: The estimateRowCount function should mirror checkTable's
defensive behavior: first call isDatabaseConfigured() and return -1 if it is
false, and wrap the pgQuery invocation in a try/catch so DB errors don't leak as
generic failures; on catch return -1 and log the error with context (use the
same logger pattern as checkTable) so callers still get -1 but the real error is
recorded. Ensure you update the estimateRowCount function (referencing pgQuery
and isDatabaseConfigured) to implement this try/catch and early-config check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 92c63b14-b6c5-4c75-8bb2-66e756890952
📒 Files selected for processing (3)
src/tools/__tests__/health.test.tssrc/tools/health.tssrc/utils/health-helpers.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/tools/tests/health.test.ts
- src/tools/health.ts
…ndling - Scope pg_class query to public schema via pg_namespace join, avoiding ambiguity when identical table names exist across schemas - Add isDatabaseConfigured() early return matching checkTable's pattern - Wrap pgQuery in try/catch returning -1 on failure, preventing DB errors from propagating as unhandled exceptions https://claude.ai/code/session_01JtWxbWQKXXxmE51EM7dmJh
get actionable diagnostics in a single call
https://claude.ai/code/session_01JtWxbWQKXXxmE51EM7dmJh