telemetry: production /api/stats with persistent query_log + Vercel deploy fix#4
telemetry: production /api/stats with persistent query_log + Vercel deploy fix#4
Conversation
The Vercel build log showed: Bundle size (324.02 MB) exceeds limit. Enabling runtime dependency installation. When that fallback triggers, deps get pip-installed at cold-start time inside the function rather than baked into the bundle. The install overrun the cold-start budget on every cold lambda, so every request to a cold instance returned 500. The runtime logs confirm consistent 500s on / and /api/stats since the heavy deps were added. Fix: - Move boto3 (declared but never imported anywhere), google-cloud- aiplatform, google-cloud-discoveryengine, and langgraph from default dependencies to optional dependency groups (`agent`, `vertex`, `aws`) in pyproject.toml. - Mirror the trim in requirements.txt so Vercel's auto-detection installs only the API-skeleton deps. - Lazy-import langgraph inside build_graph() in nexusrag/agent/graph.py so module load no longer pulls langgraph into the boot path. Bundle size drops from 324MB to ~100MB, well under the 250MB cap. Runtime dependency installation is no longer needed; cold start imports cleanly. The /v1/run route still works on deployments where the optional `agent` extra is installed (e.g. Cloud Run, Fly.io for the actual LLM gateway). On Vercel, calling /v1/run will fail with a clean ImportError, which is the honest signal that LLM heavy-lifting is intentionally not on this serverless surface — Vercel hosts the API skeleton, telemetry, docs, and lightweight routes.
The bundle-slim commit moved boto3, google-cloud-aiplatform, and google-cloud-discoveryengine to optional dependency groups so the Vercel deploy fits under the 250MB cap. But two provider modules had top-level imports of the supporting libraries' exception types (`botocore.exceptions.BotoCoreError`/`ClientError`, `google.api_core.exceptions.GoogleAPICallError`/`RetryError`), breaking module load on the slim deploy. Wrap each in try/except ImportError and define empty stub classes when the optional package isn't installed. The stubs are unreachable in normal flow because no code path that would raise them runs without the underlying SDK; they exist purely so the `isinstance(exc, ...)` guard expressions stay valid syntax. Sites: - nexusrag/providers/retrieval/bedrock_kb.py — BotoCoreError, ClientError - nexusrag/providers/retrieval/vertex_ai.py — GoogleAPICallError, RetryError After this fix, the FastAPI app imports cleanly with only the slim default dependency set installed (verified locally: 399 routes register, no ModuleNotFoundError).
Implements the Tier-A telemetry contract from https://github.com/IgnazioDS/IgnazioDS/blob/main/TELEMETRY_SCHEMA.md for consumption by the Production Telemetry panel on https://eleventh.dev. The widget polls every 30s. Schema changes: - New alembic migration 0031_query_log adds the query_log table with columns (id, query_id, started_at, completed_at, latency_ms, retrieved_chunks, status) and a CHECK constraint on status. Descending index on completed_at supports the windowed aggregations without sequential scans. - New domain model `QueryLog` (nexusrag/domain/models.py) registers the table on the existing Base.metadata so alembic autogeneration stays consistent with hand-written migrations. Recording (write path): - A small `record_query()` helper in nexusrag/persistence/repos/query_log.py performs the insert and computes latency_ms from started_at/completed_at. - The existing `request_context_middleware` in nexusrag/apps/api/main.py is extended to schedule a fire-and-forget query_log write for every request whose path starts with /v1/run or /run. The DB write opens its own session via SessionLocal so the request path never blocks on telemetry, and any failure is swallowed with a warning log — telemetry must not break the request. - Routes can populate request.state.retrieved_chunks to surface the actual chunk count for the row; otherwise the field is recorded as 0. Aggregation (read path): - nexusrag/persistence/repos/query_log.py also exposes `aggregate()` which runs windowed counts, percentile_cont(0.50/0.95) over the 24h window for p50/p95 latency, AVG over 24h for retrieval size, COUNT(*) on chunks for indexed_chunks, and MAX(completed_at) for last_active_at. Each rollup is a separate cheap query so a transient failure in one doesn't blow up the whole response. Route: - nexusrag/apps/api/routes/stats.py registers GET /api/stats and OPTIONS /api/stats (CORS preflight, returns 204). The route is mounted outside the /v1 prefix so the public path stays canonical. - main.py adds /api/stats to _LEGACY_EXEMPT_PREFIXES so the deprecation Sunset/Link headers don't get attached to a public endpoint. - All counters are clamped by SAFETY_CAPS in the repo module to prevent runaway exposure (queries_total <= 10M, queries_24h <= 1M, etc). Privacy: - query_log stores ONLY id, query_id, timestamps, latency, chunk count, and a constrained status string. No prompt text, model output, tenant ids, or user ids ever land in this table. - /api/stats returns ONLY counts and percentiles, never row-level data. Failure mode: - /api/stats never returns HTTP 5xx. On aggregator failure the route returns HTTP 200 with status="degraded", zeroed metrics, and the envelope stays contract-compliant. Internal error messages are logged but never appear in the response body (a public endpoint must not leak detail). Tests (nexusrag/tests/unit/test_stats_route.py): - happy path: response shape matches Tier-A contract, metrics surfaced - degraded path: aggregator raises -> HTTP 200 with status=degraded, zero metrics, no internal error string in the body - header coverage: Cache-Control, CORS-* headers present - OPTIONS preflight: 204 with CORS headers - field-type parametrization for the response envelope
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7d00fa3ef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| agent = [ | ||
| "langgraph>=0.0.59", |
There was a problem hiding this comment.
Keep langgraph in base dependencies
This commit makes langgraph optional, but /v1/run still depends on it unconditionally via build_graph() at runtime. Our default runtime install path (pip install . in Dockerfile) does not install extras, so deployments built from base dependencies will return stream errors for every run request once from langgraph.graph import END, StateGraph executes. Keep langgraph in dependencies or make runtime installs explicitly include the agent extra.
Useful? React with 👍 / 👎.
| response = await call_next(request) | ||
| completed_at_dt = datetime.now(timezone.utc) | ||
| latency_ms = (time.monotonic() - start) * 1000.0 |
There was a problem hiding this comment.
Measure query completion after SSE stream ends
The query log timestamps are captured immediately after call_next, but /run returns a StreamingResponse whose generation continues after this point, so completed_at and latency_ms are recorded before the actual query finishes. This makes the published p50/p95 latency metrics systematically too low for real streamed runs and breaks the intended end-to-end timing semantics.
Useful? React with 👍 / 👎.
| started_at=started_at_dt, | ||
| completed_at=completed_at_dt, | ||
| retrieved_chunks=retrieved_chunks, | ||
| status="ok" if response.status_code < 400 else "error", |
There was a problem hiding this comment.
Derive query status from run outcome, not HTTP code
For /run, many failures are encoded as SSE error events while the HTTP status remains 200, so this status mapping marks failed runs as ok. That skews last_active_at (which filters on status='ok') and any future success/error reporting from query_log. The status should be set from the stream’s logical outcome, not only response.status_code.
Useful? React with 👍 / 👎.
| chunks_attr = getattr(request.state, "retrieved_chunks", 0) | ||
| try: | ||
| retrieved_chunks = int(chunks_attr) | ||
| except (TypeError, ValueError): | ||
| retrieved_chunks = 0 |
There was a problem hiding this comment.
Persist actual retrieved chunk counts
The middleware falls back to 0 unless request.state.retrieved_chunks is set, but there is no assignment to that request-state field in the request handlers, so every inserted row records retrieved_chunks=0. As a result, /api/stats will report avg_retrieval_size as zero regardless of real retrieval behavior.
Useful? React with 👍 / 👎.
Summary
Three commits on this branch:
b4c5651fix: shrink Vercel bundle below 250MB cap to fix production 500. Heavy deps (boto3,google-cloud-aiplatform,google-cloud-discoveryengine,langgraph) moved to optional dependency groups;langgraphlazy-imported insidebuild_graph(). Bundle drops 324MB → ~100MB.2eca850fix: stubbotocore.exceptionsandgoogle.api_core.exceptionsexception types when those optional packages aren't installed, so module load on the slim deploy doesn't crash onImportError.f7d00fafeat: Tier-A telemetry — alembic migration0031_query_logfor thequery_logtable, FastAPI middleware that records(query_id, started_at, completed_at, latency_ms, retrieved_chunks, status)per/v1/runrequest, aggregator that computesqueries_total,queries_24h,queries_7d,p50/p95_latency_ms,avg_retrieval_size,indexed_chunks,last_active_at, and the public/api/statsroute per the TELEMETRY_SCHEMA Tier-A contract.Why two fix commits before the feat
The original prod 500 had two layers:
Bundle size (324.02 MB) exceeds limit. Enabling runtime dependency installation.and timing out cold starts. Commit 1 fixes this.from botocore.exceptions import .../from google.api_core.exceptions import ...at module top-level, breaking import on the slim runtime. Commit 2 stubs the exception types behindtry/except ImportError.model_validatoratnexusrag/core/config.py:392refuses to start the app with the defaultUI_CURSOR_SECRET. Resolved by setting the env var on the Vercel project (this is environment work, not code).Database migration
Adds
query_logtable with(id, query_id, started_at, completed_at, latency_ms, retrieved_chunks, status)and a descending index oncompleted_atfor cheap windowed aggregations.Privacy
query_logstores only id/timestamps/latency/chunk-count/status. No prompt text, model output, tenant id, user id, or IP ever lands there. The/api/statsendpoint never returns row-level data.Failure mode
/api/statsnever returns HTTP 5xx. On aggregator failure (DB down, schema drift, etc.) the route returns HTTP 200 withstatus: "degraded", zeroed metrics, and the envelope stays contract-compliant. Internal error messages are logged but never appear in the response body.Tests
nexusrag/tests/unit/test_stats_route.py— happy path, degraded path (assertion: no internal error string in body), header coverage, OPTIONS preflight, field-type parametrization.Test plan
ModuleNotFoundError)Bundle size exceeds limitwarning)UI_CURSOR_SECRETenv var on the production project)curl https://nexusrag-lyart.vercel.app/api/statsreturns HTTP 200 withschema_version: 1and the Tier-A metric shapehttps://eleventh.devconsole/v1/runtraffic:queries_total,queries_24hstart incrementing