speak-tutor

Real-time voice AI tutor for cocktail mixology.

Project notes:

Product plan: plan.md
Build/run workflow notes: workflow.md

Current capabilities

Real-time voice session with LiveKit room join + browser mic capture.
Scenario catalog (/scenarios) with objective checklist shown live in the UI.
Upload a cocktail image to generate a one-off custom lesson (POST /sessions/from-image).
Objective claim pipeline with three sources: live, classifier, judge.
Post-session summary + assessment + follow-ups.
Learner profile/memory extraction and profile-aware prompt injection.
Resume flow (same handle + scenario, with expiry window).
Session metrics ingest + rollups.
Past sessions list with quick resume from frontend.

Screenshots

Home and scenario selection:

Live session view:

Post-session review and feedback:

Product choices (why this is a tutor, not just voice chat)

Beyond getting the core voice interaction working, we focused on what makes this a tutor rather than only a voice chatbot.

How we interpreted the brief

The brief asks for "a real-time voice AI tutor." The word carrying most weight is tutor, so we focused on three areas:

Structure, not only conversation. Free-form "talk to a bartender AI" often stalls quickly. We use task-based sessions with a setting, a character, and explicit goals so learners always have direction.
Cascade pipeline (STT -> LLM -> TTS). For cocktail tutoring, transcript meaning is the main signal; we do not need speech-to-speech-only features like detailed prosody scoring. Cascade gives lower coupling, easier debugging, and per-stage model control.
Observability. Voice UX is highly sensitive to lag. We treat latency and reliability as product features, not only backend metrics.

Decision discipline

We bias toward data-driven decisions: benchmark where possible, and avoid making architecture/model choices only on intuition.

Product direction

We chose cocktail mixology because it is concrete, scenario-friendly, and personally motivating to practice.

Each cocktail lesson is defined as a scenario with measurable goals (for example, form goals like using "muddle"). At session end, we generate a summary and evaluate outcomes based on transcript meaning, not simple keyword matching.

Learner surfaces

On load: handle input (editable), mic-permission status, join action, returner hint when profile exists.
During a session: live objectives (primary + stretch), transcript, end-session control.
After a session: assessment, missed-opportunities feedback, remembered-profile card, clear-data action.

Additional tutoring capability

We also support visual-first custom lessons via POST /sessions/from-image: upload a cocktail image, generate a one-off scenario with goals, and run the same tutoring loop (live objectives, assessment, memory).

Tradeoffs we made

We chose a cascade pipeline (STT -> LLM -> TTS) for observability, modularity, and easier tuning/debugging, even though it adds latency compared with speech-to-speech.

We ran microbenchmarks (benchmarks/results.md, run date 2026-04-23):

STT: cartesia/ink-whisper had the best latency (P50 414 ms) and is the recommended default in .env.example.
LLM: openai/gpt-4.1-mini was much faster TTFT (P50 1450 ms cold) than openai/gpt-5.3-chat-latest (P50 2491 ms), with close judged quality (4.73 vs 4.80 / 5).
TTS: cartesia/sonic-turbo remained the fastest option (P50 369 ms).

So for the live loop we prioritize responsiveness (ink-whisper + gpt-4.1-mini + sonic-turbo), while using stronger-but-slower models where latency is less user-visible (post-session judge and memory extraction).

Model config is env-driven. If STT_MODEL / LLM_MODEL / TTS_MODEL are unset, code-level fallbacks are deepgram/flux-general-en, openai/gpt-5.3-chat-latest, and cartesia/sonic-turbo.

Additional engineering notes (moved from `plan.md`)

Objective progress is server-owned and idempotent (session_id + objective_id) to avoid duplicate/missed state during interruptions.
Session close runs two post-session tasks in parallel: assessment generation and learner-memory extraction.
Resumability restores unfinished sessions by handle + scenario with a recap-oriented prompt seed.
Memory is intentionally lightweight and inspectable, with a one-click clear-data path.
We kept advanced features out of v1 scope: auth/admin, pronunciation scoring, and complex mid-session hint systems.

Architecture overview

This system is split into three cooperating parts: a browser frontend for voice UX, a LiveKit worker (agent/) that runs the real-time STT -> LLM -> TTS loop and objective tools, and a FastAPI server (server/) that owns durable state (sessions/events/claims/assessments/memory) in SQLite and serves APIs/UI.

flowchart LR
    A["Frontend (browser)"] -->|"join + mic + UI events"| B["LiveKit Room"]
    B --> C["Agent Worker (agent/)"]
    C -->|"internal HTTP"| D["FastAPI Server (server/)"]
    D --> E["SQLite"]
    C -->|"objective/tool + metrics"| D
    D -->|"summary/assessment/profile"| A

Scaling considerations

Current design is intentionally simple for local/demo use: one FastAPI process, one SQLite file, and LiveKit-managed room/agent routing. The architecture boundaries are already split so scaling is mainly an infrastructure swap, not a product rewrite.

Where bottlenecks appear first

SQLite write contention on sessions/events/claims/assessments/metrics.
Post-session LLM jobs (assessment + memory extraction) during traffic spikes.
Real-time model latency/cost under high concurrent session counts.
Regional round-trip delay when users are far from the worker/provider region.

Path to ~10,000 concurrent sessions

Move stateful tables from SQLite to Postgres with connection pooling.
Keep FastAPI stateless and run multiple replicas behind a load balancer.
Scale LiveKit workers horizontally by active-room count.
Move post-session tasks to an async queue (for example Redis/Kafka + workers) so real-time turns stay low-latency.
Store high-volume telemetry in a metrics pipeline (not OLTP tables), keeping product-state writes separate from observability writes.

Operational safeguards for scale

Add per-handle and per-IP rate limits on session creation.
Add jittered scheduling/backpressure for end-of-session jobs to avoid thundering-herd spikes.
Run regional worker pools and route users to nearest region.
Track SLOs: end-to-end turn latency, STT finalization time, LLM TTFT, TTS TTFB, and post-session queue delay.

This keeps the current tutoring flow and API contracts intact while removing single-node bottlenecks as traffic grows.

Run (Docker, recommended)

Create env file:

cp .env.example .env
# required for homework/demo: LIVEKIT_URL / LIVEKIT_API_KEY / LIVEKIT_API_SECRET
# INTERNAL_API_KEY can stay default for local single-machine testing

Build and start:
```
docker compose up --build
```
Open http://localhost:8000 and start a session.

Stop and remove data volume:

docker compose down -v

Run (local, no Docker)

Requires Python 3.12 and uv.

cp .env.example .env
set -a; source .env; set +a
mkdir -p data
export DATABASE_PATH="$(pwd)/data/tutor.sqlite"
export SERVER_BASE_URL="http://127.0.0.1:8000"
uv sync
uv run python -m agent.worker download-files   # one-time model assets

Run server and worker in separate terminals:

# terminal 1
set -a; source .env; set +a
export DATABASE_PATH="$(pwd)/data/tutor.sqlite"
uv run uvicorn server.main:app --reload

# terminal 2
set -a; source .env; set +a
export SERVER_BASE_URL="http://127.0.0.1:8000"
uv run python -m agent.worker dev

API surface

OpenAPI/docs are disabled in app config, so use routes directly.

Public routes:

GET /healthz
GET /scenarios
GET /scenarios/{scenario_id}
POST /sessions
POST /sessions/from-image (multipart: handle + image)
POST /sessions/{session_id}/end
GET /sessions/{session_id}
GET /sessions/{session_id}/summary
GET /sessions/{session_id}/metrics
GET /learners/{handle}
GET /learners/{handle}/sessions
DELETE /learners/{handle}

Internal agent routes (require X-Internal-Key):

POST /sessions/{session_id}/events
POST /sessions/{session_id}/claim
POST /sessions/{session_id}/metrics
GET /sessions/{session_id}/seed
GET /sessions/{session_id}/scenario

Repository layout

server/ FastAPI app, SQLite persistence/migrations, routes, post-session pipeline.
agent/ LiveKit worker, conversation loop, classifier, persona guard, metrics client.
frontend/ vanilla HTML/CSS/JS app for topic selection, live session, and review/resume UX.
scenarios/ authored learning scenarios/objectives.
benchmarks/ Stage 4.5 model benchmark scripts (python -m benchmarks [tts|stt|llm|all]).
references/ research scripts/data used during development.
media/ static assets.

Environment variables

Required:

LIVEKIT_URL
LIVEKIT_API_KEY
LIVEKIT_API_SECRET

Optional / security hardening:

INTERNAL_API_KEY (defaults to change-me; set a unique value for shared deployments)

Model selection:

STT_MODEL
LLM_MODEL
TTS_MODEL
JUDGE_MODEL
MEMORY_MODEL
SCENARIO_GEN_MODEL
SCENARIO_GEN_TIMEOUT_S
OBJECTIVE_CLASSIFIER_MODEL (optional; defaults in code)

Runtime behavior:

MEMORY_ENABLED
VAD_ACTIVATION_THRESHOLD
VAD_MIN_SILENCE_MS
DATABASE_PATH (useful for local runs; Docker uses /data/tutor.sqlite)
SERVER_BASE_URL (agent -> server base URL; must be http://127.0.0.1:8000 for local worker)

Security note: keep .env local and untracked. This repo tracks only .env.example.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
agent		agent
benchmarks		benchmarks
frontend		frontend
media		media
references		references
scenarios		scenarios
server		server
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
plan.md		plan.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
workflow.md		workflow.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speak-tutor

Current capabilities

Screenshots

Product choices (why this is a tutor, not just voice chat)

How we interpreted the brief

Decision discipline

Product direction

Learner surfaces

Additional tutoring capability

Tradeoffs we made

Additional engineering notes (moved from `plan.md`)

Architecture overview

Scaling considerations

Where bottlenecks appear first

Path to ~10,000 concurrent sessions

Operational safeguards for scale

Run (Docker, recommended)

Run (local, no Docker)

API surface

Repository layout

Environment variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

speak-tutor

Current capabilities

Screenshots

Product choices (why this is a tutor, not just voice chat)

How we interpreted the brief

Decision discipline

Product direction

Learner surfaces

Additional tutoring capability

Tradeoffs we made

Additional engineering notes (moved from plan.md)

Architecture overview

Scaling considerations

Where bottlenecks appear first

Path to ~10,000 concurrent sessions

Operational safeguards for scale

Run (Docker, recommended)

Run (local, no Docker)

API surface

Repository layout

Environment variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Additional engineering notes (moved from `plan.md`)

Packages