Real-time voice AI tutor for cocktail mixology.
Project notes:
- Product plan:
plan.md - Build/run workflow notes:
workflow.md
- Real-time voice session with LiveKit room join + browser mic capture.
- Scenario catalog (
/scenarios) with objective checklist shown live in the UI. - Upload a cocktail image to generate a one-off custom lesson (
POST /sessions/from-image). - Objective claim pipeline with three sources:
live,classifier,judge. - Post-session summary + assessment + follow-ups.
- Learner profile/memory extraction and profile-aware prompt injection.
- Resume flow (same handle + scenario, with expiry window).
- Session metrics ingest + rollups.
- Past sessions list with quick resume from frontend.
Home and scenario selection:
Live session view:
Post-session review and feedback:
Beyond getting the core voice interaction working, we focused on what makes this a tutor rather than only a voice chatbot.
The brief asks for "a real-time voice AI tutor." The word carrying most weight is tutor, so we focused on three areas:
- Structure, not only conversation. Free-form "talk to a bartender AI" often stalls quickly. We use task-based sessions with a setting, a character, and explicit goals so learners always have direction.
- Cascade pipeline (
STT -> LLM -> TTS). For cocktail tutoring, transcript meaning is the main signal; we do not need speech-to-speech-only features like detailed prosody scoring. Cascade gives lower coupling, easier debugging, and per-stage model control. - Observability. Voice UX is highly sensitive to lag. We treat latency and reliability as product features, not only backend metrics.
We bias toward data-driven decisions: benchmark where possible, and avoid making architecture/model choices only on intuition.
We chose cocktail mixology because it is concrete, scenario-friendly, and personally motivating to practice.
Each cocktail lesson is defined as a scenario with measurable goals (for example, form goals like using "muddle"). At session end, we generate a summary and evaluate outcomes based on transcript meaning, not simple keyword matching.
- On load: handle input (editable), mic-permission status, join action, returner hint when profile exists.
- During a session: live objectives (primary + stretch), transcript, end-session control.
- After a session: assessment, missed-opportunities feedback, remembered-profile card, clear-data action.
We also support visual-first custom lessons via POST /sessions/from-image: upload a cocktail image, generate a one-off scenario with goals, and run the same tutoring loop (live objectives, assessment, memory).
We chose a cascade pipeline (STT -> LLM -> TTS) for observability, modularity, and easier tuning/debugging, even though it adds latency compared with speech-to-speech.
We ran microbenchmarks (benchmarks/results.md, run date 2026-04-23):
- STT:
cartesia/ink-whisperhad the best latency (P50 414 ms) and is the recommended default in.env.example. - LLM:
openai/gpt-4.1-miniwas much faster TTFT (P50 1450 ms cold) thanopenai/gpt-5.3-chat-latest(P50 2491 ms), with close judged quality (4.73 vs 4.80 / 5). - TTS:
cartesia/sonic-turboremained the fastest option (P50 369 ms).
So for the live loop we prioritize responsiveness (ink-whisper + gpt-4.1-mini + sonic-turbo), while using stronger-but-slower models where latency is less user-visible (post-session judge and memory extraction).
Model config is env-driven. If STT_MODEL / LLM_MODEL / TTS_MODEL are unset, code-level fallbacks are deepgram/flux-general-en, openai/gpt-5.3-chat-latest, and cartesia/sonic-turbo.
- Objective progress is server-owned and idempotent (
session_id + objective_id) to avoid duplicate/missed state during interruptions. - Session close runs two post-session tasks in parallel: assessment generation and learner-memory extraction.
- Resumability restores unfinished sessions by handle + scenario with a recap-oriented prompt seed.
- Memory is intentionally lightweight and inspectable, with a one-click clear-data path.
- We kept advanced features out of v1 scope: auth/admin, pronunciation scoring, and complex mid-session hint systems.
This system is split into three cooperating parts: a browser frontend for voice UX,
a LiveKit worker (agent/) that runs the real-time STT -> LLM -> TTS loop and
objective tools, and a FastAPI server (server/) that owns durable state
(sessions/events/claims/assessments/memory) in SQLite and serves APIs/UI.
flowchart LR
A["Frontend (browser)"] -->|"join + mic + UI events"| B["LiveKit Room"]
B --> C["Agent Worker (agent/)"]
C -->|"internal HTTP"| D["FastAPI Server (server/)"]
D --> E["SQLite"]
C -->|"objective/tool + metrics"| D
D -->|"summary/assessment/profile"| A
Current design is intentionally simple for local/demo use: one FastAPI process, one SQLite file, and LiveKit-managed room/agent routing. The architecture boundaries are already split so scaling is mainly an infrastructure swap, not a product rewrite.
- SQLite write contention on
sessions/events/claims/assessments/metrics. - Post-session LLM jobs (assessment + memory extraction) during traffic spikes.
- Real-time model latency/cost under high concurrent session counts.
- Regional round-trip delay when users are far from the worker/provider region.
- Move stateful tables from SQLite to Postgres with connection pooling.
- Keep FastAPI stateless and run multiple replicas behind a load balancer.
- Scale LiveKit workers horizontally by active-room count.
- Move post-session tasks to an async queue (for example Redis/Kafka + workers) so real-time turns stay low-latency.
- Store high-volume telemetry in a metrics pipeline (not OLTP tables), keeping product-state writes separate from observability writes.
- Add per-handle and per-IP rate limits on session creation.
- Add jittered scheduling/backpressure for end-of-session jobs to avoid thundering-herd spikes.
- Run regional worker pools and route users to nearest region.
- Track SLOs: end-to-end turn latency, STT finalization time, LLM TTFT, TTS TTFB, and post-session queue delay.
This keeps the current tutoring flow and API contracts intact while removing single-node bottlenecks as traffic grows.
-
Create env file:
cp .env.example .env # required for homework/demo: LIVEKIT_URL / LIVEKIT_API_KEY / LIVEKIT_API_SECRET # INTERNAL_API_KEY can stay default for local single-machine testing
-
Build and start:
docker compose up --build
-
Open http://localhost:8000 and start a session.
Stop and remove data volume:
docker compose down -vRequires Python 3.12 and uv.
cp .env.example .env
set -a; source .env; set +a
mkdir -p data
export DATABASE_PATH="$(pwd)/data/tutor.sqlite"
export SERVER_BASE_URL="http://127.0.0.1:8000"
uv sync
uv run python -m agent.worker download-files # one-time model assetsRun server and worker in separate terminals:
# terminal 1
set -a; source .env; set +a
export DATABASE_PATH="$(pwd)/data/tutor.sqlite"
uv run uvicorn server.main:app --reload# terminal 2
set -a; source .env; set +a
export SERVER_BASE_URL="http://127.0.0.1:8000"
uv run python -m agent.worker devOpenAPI/docs are disabled in app config, so use routes directly.
Public routes:
GET /healthzGET /scenariosGET /scenarios/{scenario_id}POST /sessionsPOST /sessions/from-image(multipart:handle+image)POST /sessions/{session_id}/endGET /sessions/{session_id}GET /sessions/{session_id}/summaryGET /sessions/{session_id}/metricsGET /learners/{handle}GET /learners/{handle}/sessionsDELETE /learners/{handle}
Internal agent routes (require X-Internal-Key):
POST /sessions/{session_id}/eventsPOST /sessions/{session_id}/claimPOST /sessions/{session_id}/metricsGET /sessions/{session_id}/seedGET /sessions/{session_id}/scenario
server/FastAPI app, SQLite persistence/migrations, routes, post-session pipeline.agent/LiveKit worker, conversation loop, classifier, persona guard, metrics client.frontend/vanilla HTML/CSS/JS app for topic selection, live session, and review/resume UX.scenarios/authored learning scenarios/objectives.benchmarks/Stage 4.5 model benchmark scripts (python -m benchmarks [tts|stt|llm|all]).references/research scripts/data used during development.media/static assets.
Required:
LIVEKIT_URLLIVEKIT_API_KEYLIVEKIT_API_SECRET
Optional / security hardening:
INTERNAL_API_KEY(defaults tochange-me; set a unique value for shared deployments)
Model selection:
STT_MODELLLM_MODELTTS_MODELJUDGE_MODELMEMORY_MODELSCENARIO_GEN_MODELSCENARIO_GEN_TIMEOUT_SOBJECTIVE_CLASSIFIER_MODEL(optional; defaults in code)
Runtime behavior:
MEMORY_ENABLEDVAD_ACTIVATION_THRESHOLDVAD_MIN_SILENCE_MSDATABASE_PATH(useful for local runs; Docker uses/data/tutor.sqlite)SERVER_BASE_URL(agent -> server base URL; must behttp://127.0.0.1:8000for local worker)
Security note: keep .env local and untracked. This repo tracks only .env.example.


