Skip to content

1in0/ai-tutor

Repository files navigation

speak-tutor

Real-time voice AI tutor for cocktail mixology.

Project notes:

  • Product plan: plan.md
  • Build/run workflow notes: workflow.md

Current capabilities

  • Real-time voice session with LiveKit room join + browser mic capture.
  • Scenario catalog (/scenarios) with objective checklist shown live in the UI.
  • Upload a cocktail image to generate a one-off custom lesson (POST /sessions/from-image).
  • Objective claim pipeline with three sources: live, classifier, judge.
  • Post-session summary + assessment + follow-ups.
  • Learner profile/memory extraction and profile-aware prompt injection.
  • Resume flow (same handle + scenario, with expiry window).
  • Session metrics ingest + rollups.
  • Past sessions list with quick resume from frontend.

Screenshots

Home and scenario selection:

Home and scenario selection

Live session view:

Live session

Post-session review and feedback:

Post-session review

Product choices (why this is a tutor, not just voice chat)

Beyond getting the core voice interaction working, we focused on what makes this a tutor rather than only a voice chatbot.

How we interpreted the brief

The brief asks for "a real-time voice AI tutor." The word carrying most weight is tutor, so we focused on three areas:

  1. Structure, not only conversation. Free-form "talk to a bartender AI" often stalls quickly. We use task-based sessions with a setting, a character, and explicit goals so learners always have direction.
  2. Cascade pipeline (STT -> LLM -> TTS). For cocktail tutoring, transcript meaning is the main signal; we do not need speech-to-speech-only features like detailed prosody scoring. Cascade gives lower coupling, easier debugging, and per-stage model control.
  3. Observability. Voice UX is highly sensitive to lag. We treat latency and reliability as product features, not only backend metrics.

Decision discipline

We bias toward data-driven decisions: benchmark where possible, and avoid making architecture/model choices only on intuition.

Product direction

We chose cocktail mixology because it is concrete, scenario-friendly, and personally motivating to practice.

Each cocktail lesson is defined as a scenario with measurable goals (for example, form goals like using "muddle"). At session end, we generate a summary and evaluate outcomes based on transcript meaning, not simple keyword matching.

Learner surfaces

  • On load: handle input (editable), mic-permission status, join action, returner hint when profile exists.
  • During a session: live objectives (primary + stretch), transcript, end-session control.
  • After a session: assessment, missed-opportunities feedback, remembered-profile card, clear-data action.

Additional tutoring capability

We also support visual-first custom lessons via POST /sessions/from-image: upload a cocktail image, generate a one-off scenario with goals, and run the same tutoring loop (live objectives, assessment, memory).

Tradeoffs we made

We chose a cascade pipeline (STT -> LLM -> TTS) for observability, modularity, and easier tuning/debugging, even though it adds latency compared with speech-to-speech.

We ran microbenchmarks (benchmarks/results.md, run date 2026-04-23):

  • STT: cartesia/ink-whisper had the best latency (P50 414 ms) and is the recommended default in .env.example.
  • LLM: openai/gpt-4.1-mini was much faster TTFT (P50 1450 ms cold) than openai/gpt-5.3-chat-latest (P50 2491 ms), with close judged quality (4.73 vs 4.80 / 5).
  • TTS: cartesia/sonic-turbo remained the fastest option (P50 369 ms).

So for the live loop we prioritize responsiveness (ink-whisper + gpt-4.1-mini + sonic-turbo), while using stronger-but-slower models where latency is less user-visible (post-session judge and memory extraction).

Model config is env-driven. If STT_MODEL / LLM_MODEL / TTS_MODEL are unset, code-level fallbacks are deepgram/flux-general-en, openai/gpt-5.3-chat-latest, and cartesia/sonic-turbo.

Additional engineering notes (moved from plan.md)

  • Objective progress is server-owned and idempotent (session_id + objective_id) to avoid duplicate/missed state during interruptions.
  • Session close runs two post-session tasks in parallel: assessment generation and learner-memory extraction.
  • Resumability restores unfinished sessions by handle + scenario with a recap-oriented prompt seed.
  • Memory is intentionally lightweight and inspectable, with a one-click clear-data path.
  • We kept advanced features out of v1 scope: auth/admin, pronunciation scoring, and complex mid-session hint systems.

Architecture overview

This system is split into three cooperating parts: a browser frontend for voice UX, a LiveKit worker (agent/) that runs the real-time STT -> LLM -> TTS loop and objective tools, and a FastAPI server (server/) that owns durable state (sessions/events/claims/assessments/memory) in SQLite and serves APIs/UI.

flowchart LR
    A["Frontend (browser)"] -->|"join + mic + UI events"| B["LiveKit Room"]
    B --> C["Agent Worker (agent/)"]
    C -->|"internal HTTP"| D["FastAPI Server (server/)"]
    D --> E["SQLite"]
    C -->|"objective/tool + metrics"| D
    D -->|"summary/assessment/profile"| A
Loading

Scaling considerations

Current design is intentionally simple for local/demo use: one FastAPI process, one SQLite file, and LiveKit-managed room/agent routing. The architecture boundaries are already split so scaling is mainly an infrastructure swap, not a product rewrite.

Where bottlenecks appear first

  1. SQLite write contention on sessions/events/claims/assessments/metrics.
  2. Post-session LLM jobs (assessment + memory extraction) during traffic spikes.
  3. Real-time model latency/cost under high concurrent session counts.
  4. Regional round-trip delay when users are far from the worker/provider region.

Path to ~10,000 concurrent sessions

  1. Move stateful tables from SQLite to Postgres with connection pooling.
  2. Keep FastAPI stateless and run multiple replicas behind a load balancer.
  3. Scale LiveKit workers horizontally by active-room count.
  4. Move post-session tasks to an async queue (for example Redis/Kafka + workers) so real-time turns stay low-latency.
  5. Store high-volume telemetry in a metrics pipeline (not OLTP tables), keeping product-state writes separate from observability writes.

Operational safeguards for scale

  1. Add per-handle and per-IP rate limits on session creation.
  2. Add jittered scheduling/backpressure for end-of-session jobs to avoid thundering-herd spikes.
  3. Run regional worker pools and route users to nearest region.
  4. Track SLOs: end-to-end turn latency, STT finalization time, LLM TTFT, TTS TTFB, and post-session queue delay.

This keeps the current tutoring flow and API contracts intact while removing single-node bottlenecks as traffic grows.

Run (Docker, recommended)

  1. Create env file:

    cp .env.example .env
    # required for homework/demo: LIVEKIT_URL / LIVEKIT_API_KEY / LIVEKIT_API_SECRET
    # INTERNAL_API_KEY can stay default for local single-machine testing
  2. Build and start:

    docker compose up --build
  3. Open http://localhost:8000 and start a session.

Stop and remove data volume:

docker compose down -v

Run (local, no Docker)

Requires Python 3.12 and uv.

cp .env.example .env
set -a; source .env; set +a
mkdir -p data
export DATABASE_PATH="$(pwd)/data/tutor.sqlite"
export SERVER_BASE_URL="http://127.0.0.1:8000"
uv sync
uv run python -m agent.worker download-files   # one-time model assets

Run server and worker in separate terminals:

# terminal 1
set -a; source .env; set +a
export DATABASE_PATH="$(pwd)/data/tutor.sqlite"
uv run uvicorn server.main:app --reload
# terminal 2
set -a; source .env; set +a
export SERVER_BASE_URL="http://127.0.0.1:8000"
uv run python -m agent.worker dev

API surface

OpenAPI/docs are disabled in app config, so use routes directly.

Public routes:

  • GET /healthz
  • GET /scenarios
  • GET /scenarios/{scenario_id}
  • POST /sessions
  • POST /sessions/from-image (multipart: handle + image)
  • POST /sessions/{session_id}/end
  • GET /sessions/{session_id}
  • GET /sessions/{session_id}/summary
  • GET /sessions/{session_id}/metrics
  • GET /learners/{handle}
  • GET /learners/{handle}/sessions
  • DELETE /learners/{handle}

Internal agent routes (require X-Internal-Key):

  • POST /sessions/{session_id}/events
  • POST /sessions/{session_id}/claim
  • POST /sessions/{session_id}/metrics
  • GET /sessions/{session_id}/seed
  • GET /sessions/{session_id}/scenario

Repository layout

  • server/ FastAPI app, SQLite persistence/migrations, routes, post-session pipeline.
  • agent/ LiveKit worker, conversation loop, classifier, persona guard, metrics client.
  • frontend/ vanilla HTML/CSS/JS app for topic selection, live session, and review/resume UX.
  • scenarios/ authored learning scenarios/objectives.
  • benchmarks/ Stage 4.5 model benchmark scripts (python -m benchmarks [tts|stt|llm|all]).
  • references/ research scripts/data used during development.
  • media/ static assets.

Environment variables

Required:

  • LIVEKIT_URL
  • LIVEKIT_API_KEY
  • LIVEKIT_API_SECRET

Optional / security hardening:

  • INTERNAL_API_KEY (defaults to change-me; set a unique value for shared deployments)

Model selection:

  • STT_MODEL
  • LLM_MODEL
  • TTS_MODEL
  • JUDGE_MODEL
  • MEMORY_MODEL
  • SCENARIO_GEN_MODEL
  • SCENARIO_GEN_TIMEOUT_S
  • OBJECTIVE_CLASSIFIER_MODEL (optional; defaults in code)

Runtime behavior:

  • MEMORY_ENABLED
  • VAD_ACTIVATION_THRESHOLD
  • VAD_MIN_SILENCE_MS
  • DATABASE_PATH (useful for local runs; Docker uses /data/tutor.sqlite)
  • SERVER_BASE_URL (agent -> server base URL; must be http://127.0.0.1:8000 for local worker)

Security note: keep .env local and untracked. This repo tracks only .env.example.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors