Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Assets/Landing-page-dark_mode.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Assets/Landing_page_light-mode.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Assets/Working.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 55 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,32 @@
# AI Powered Live Doubt Manager

A production-grade system that helps teachers manage live YouTube teaching sessions at scale. It polls the YouTube live chat in real time, uses Gemini AI to classify and cluster student questions, generates answers, and delivers them back to the teacher's dashboard over WebSocket — all while optionally posting responses directly into the stream.
> Real-time semantic clustering of YouTube live chat via Gemini embeddings, pgvector cosine search, and a 6-stage Redis worker pipeline — with RAG-augmented answer generation and WebSocket delivery.

> **Portfolio / demo project.** Built to demonstrate full-stack architecture with AI pipelines, real-time WebSockets, and worker-based processing. Not deployed to production.

- **Online nearest-centroid clustering** — each incoming question is embedded (768-dim Gemini vectors), compared against existing cluster centroids via pgvector cosine distance, and assigned or seeded into a new cluster in a single atomic transaction — no batch reprocessing.
- **Teacher-scoped RAG retrieval** — answer generation queries only the documents uploaded by the session's owner, using the cluster centroid as the search vector so retrieved context matches the cluster's theme, not just one question's phrasing.
- **Circuit-breaker-protected worker pipeline** — six independent workers connected by Redis ZSET queues with priority scoring, DLQ after 3 retries, and a circuit breaker on every Gemini call that trips open on sustained failures and exports state to Prometheus.

Teachers running live YouTube sessions are bombarded with chat messages — most are noise, but buried in the flood are genuine student questions. This system watches the live chat, uses Gemini AI to find and cluster those questions, and generates grounded answers using teacher-uploaded materials. The result is a real-time dashboard that turns an unreadable chat stream into an organized, actionable Q&A feed — built for educators who teach at scale.

## Screenshots

<table>
<tr>
<td><img src="Assets/Landing_page_light-mode.png" alt="Landing page — light mode" width="100%"></td>
<td><img src="Assets/Landing-page-dark_mode.png" alt="Landing page — dark mode" width="100%"></td>
</tr>
<tr>
<td colspan="2" align="center"><em>Landing page — light &amp; dark mode</em></td>
</tr>
</table>

<p align="center">
<img src="Assets/Working.png" alt="Live dashboard" width="100%">
<br>
<em>Live dashboard — real-time question clustering, AI answers, and YouTube integration</em>
</p>

## Stack

Expand All @@ -13,6 +39,14 @@ A production-grade system that helps teachers manage live YouTube teaching sessi
| Browser | Chrome extension (TypeScript + Vite) |
| Infrastructure | Docker Compose (local), Terraform (cloud), Prometheus + Grafana (observability) |

## How It Works

1. **Connect** — Teacher links their YouTube live stream via OAuth and starts a session
2. **Ingest** — A polling worker pulls new chat messages into a Redis queue every second
3. **Classify & Embed** — Gemini labels each message as a question or not, then generates a 768-dim embedding vector
4. **Cluster** — Nearest-centroid grouping via pgvector clusters similar questions together; answer generation triggers at milestone counts (3, 10, 25)
5. **Answer & Deliver** — RAG-augmented answers are generated from teacher-uploaded documents, pushed to the dashboard over WebSocket, and optionally posted back to YouTube chat

## Architecture

```
Expand Down Expand Up @@ -46,14 +80,16 @@ Comments flow from YouTube → Redis workers → Gemini AI for classification an

## Features

- **Real-time question clustering** — student comments are embedded and clustered live using nearest-centroid algorithm with milestone triggers
- **RAG-augmented answers** — AI-generated answers grounded in teacher-uploaded documents (PDF, DOCX, TXT)
- **YouTube integration** — polls live chat, posts answers directly back to YouTube
- **Content moderation** — Gemini-powered filtering before classification and before YouTube posting
- **WebSocket dashboard** — real-time updates with exponential backoff reconnection and 100-message cap
- **Teacher isolation** — every data endpoint enforces ownership; RAG retrieval is scoped per teacher
- **Observability** — Prometheus metrics, circuit breaker pattern on all Gemini calls, structured logging
- **Scheduled maintenance** — automatic daily quota reset and hourly expired token cleanup
| Feature | What it does |
|---|---|
| Real-time question clustering | Student comments are embedded and clustered live using nearest-centroid algorithm with milestone triggers |
| RAG-augmented answers | AI-generated answers grounded in teacher-uploaded documents (PDF, DOCX, TXT) |
| YouTube integration | Polls live chat, posts answers directly back to YouTube |
| Content moderation | Gemini-powered filtering before classification and before YouTube posting |
| WebSocket dashboard | Real-time updates with exponential backoff reconnection and 100-message cap |
| Teacher isolation | Every data endpoint enforces ownership; RAG retrieval is scoped per teacher |
| Observability | Prometheus metrics, circuit breaker pattern on all Gemini calls, structured logging |
| Scheduled maintenance | Automatic daily quota reset and hourly expired token cleanup |

## Quick Start

Expand Down Expand Up @@ -142,11 +178,17 @@ This opens a tmux session with 9 panes: backend API, 6 AI workers, scheduler, an
| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Health check |
| `POST` | `/api/v1/auth/register` | Register a new teacher |
| `POST` | `/api/v1/auth/login` | Authenticate, returns JWT |
| `GET` | `/api/v1/auth/me` | Get current authenticated teacher |
| `GET` | `/api/v1/sessions` | List teacher's sessions |
| `POST` | `/api/v1/sessions` | Create a new session |
| `GET` | `/api/v1/clusters` | List question clusters for a session |
| `POST` | `/api/v1/dashboard/approve` | Approve an AI-generated answer |
| `POST` | `/api/v1/sessions` | Create a new streaming session |
| `GET` | `/api/v1/sessions/{id}/clusters` | List question clusters for a session |
| `GET` | `/api/v1/sessions/{id}/analytics` | Get aggregate session analytics |
| `POST` | `/api/v1/dashboard/sessions/{id}/manual-question` | Submit a manual question |
| `POST` | `/api/v1/dashboard/answers/{id}/approve` | Approve an AI-generated answer |
| `GET` | `/api/v1/dashboard/sessions/{id}/stats` | Get session stats |
| `POST` | `/api/v1/rag/documents` | Upload a document for RAG retrieval |
| `GET` | `/api/v1/youtube/auth/url` | Start YouTube OAuth flow |
| `WS` | `/ws/{session_id}` | Real-time event stream |

Expand All @@ -163,7 +205,7 @@ make test # run tests
## Known Limitations

- **No production deployment config** — docker-compose is development-oriented; nginx and production Dockerfile are not included
- **Chrome extension** — functional but not published to the Chrome Web Store
- **Chrome extension** — functional but currently in testing
- **YouTube quota** — the YouTube Data API v3 has daily quota limits; high-traffic sessions may hit limits
- **Single-region** — no multi-region or horizontal scaling configuration

Expand Down
6 changes: 4 additions & 2 deletions backend/alembic/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,10 @@
pool,
)

# Add parent directory to path
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
# Add backend/ and project root to path so both app and workers are importable
_backend_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, _backend_dir)
sys.path.insert(0, os.path.dirname(_backend_dir))

from app.core.config import settings
from app.db.base import Base
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""make youtube_video_id nullable

Revision ID: e5f6a7b8c9d0
Revises: 6fe440076f64, d4e5f6a7b8c9
Create Date: 2026-03-18 12:00:00.000000

"""

from alembic import op

# revision identifiers, used by Alembic.
revision = "e5f6a7b8c9d0"
down_revision = ("6fe440076f64", "d4e5f6a7b8c9")
branch_labels = None
depends_on = None


def upgrade() -> None:
op.alter_column("streaming_sessions", "youtube_video_id", nullable=True)


def downgrade() -> None:
op.execute("UPDATE streaming_sessions SET youtube_video_id = '' WHERE youtube_video_id IS NULL")
op.alter_column("streaming_sessions", "youtube_video_id", nullable=False)
2 changes: 1 addition & 1 deletion backend/app/db/models/streaming_session.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ class StreamingSession(Base):

id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4, index=True)
teacher_id = Column(UUID(as_uuid=True), ForeignKey("teachers.id", ondelete="CASCADE"), nullable=False, index=True)
youtube_video_id = Column(String(255), nullable=False, index=True)
youtube_video_id = Column(String(255), nullable=True, index=True)
title = Column(String(500), nullable=True)
description = Column(Text, nullable=True)
is_active = Column(Boolean, default=True, nullable=False)
Expand Down
4 changes: 2 additions & 2 deletions backend/app/schemas/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
class SessionCreate(BaseModel):
"""Session creation schema."""

youtube_video_id: str
youtube_video_id: Optional[str] = None
title: Optional[str] = None
description: Optional[str] = None

Expand All @@ -35,7 +35,7 @@ class SessionResponse(BaseModel):

id: UUID
teacher_id: UUID
youtube_video_id: str
youtube_video_id: Optional[str]
title: Optional[str]
description: Optional[str]
is_active: bool
Expand Down
8 changes: 4 additions & 4 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,10 @@ Complete listing of every file in this docs repository:
**Workers**
- [workers/overview.md](workers/overview.md) — QueueManager, retry, DLQ, runner.py
- [workers/classification.md](workers/classification.md) — ClassificationPayload → is_question
- [workers/embeddings.md](workers/embeddings.md) — EmbeddingPayload → 1536-dim pgvector
- [workers/clustering.md](workers/clustering.md) — Cosine similarity at 0.8 → Cluster CRUD
- [workers/embeddings.md](workers/embeddings.md) — EmbeddingPayload → 768-dim pgvector
- [workers/clustering.md](workers/clustering.md) — Cosine similarity at 0.65 → Cluster CRUD
- [workers/answer-generation.md](workers/answer-generation.md) — RAG + LLM → Answer record
- [workers/trigger-monitor.md](workers/trigger-monitor.md) — Count/interval thresholds → clustering dispatch
- [workers/trigger-monitor.md](workers/trigger-monitor.md) — **Stub** (not implemented, not started by start_dev.sh)
- [workers/youtube-polling.md](workers/youtube-polling.md) — ThreadPoolExecutor, chat_id cache, dedup
- [workers/youtube-posting.md](workers/youtube-posting.md) — YouTubePostingPayload → YouTube API

Expand Down Expand Up @@ -132,7 +132,7 @@ Complete listing of every file in this docs repository:
- [infra/configuration-reference.md](infra/configuration-reference.md) — Master table of ALL env vars

**Observability**
- [observability/metrics.md](observability/metrics.md) — All 6 Prometheus metrics with labels + PromQL
- [observability/metrics.md](observability/metrics.md) — All Prometheus metrics with labels + PromQL
- [observability/logging.md](observability/logging.md) — Log levels, JSON format, RequestContextMiddleware
- [observability/dashboards.md](observability/dashboards.md) — 4 Grafana dashboards: panels, import
- [observability/alerting.md](observability/alerting.md) — prometheus/rules.yml, thresholds, incident checklist
Expand Down
14 changes: 7 additions & 7 deletions docs/SYSTEM_DESIGN_REPORT.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,9 @@ Comments (M) ──▶ (1) Clusters [SET NULL on cluster delete]

| Model | Column | Dimensions | Index | Purpose |
|-------|--------|-----------|-------|---------|
| `comments` | `embedding` | Vector(768) | None (sequential scan) | Semantic similarity for clustering |
| `clusters` | `centroid_embedding` | Vector(768) | None (sequential scan) | Running centroid for nearest-centroid matching |
| `rag_documents` | `embedding` | Vector(768) | None (sequential scan) | Cosine distance retrieval for answer grounding |
| `comments` | `embedding` | Vector(768) | HNSW (cosine_ops, m=16, ef=64) | Semantic similarity for clustering |
| `clusters` | `centroid_embedding` | Vector(768) | HNSW (cosine_ops, m=16, ef=64) | Running centroid for nearest-centroid matching |
| `rag_documents` | `embedding` | Vector(768) | HNSW (cosine_ops, m=16, ef=64) | Cosine distance retrieval for answer grounding |

**Why 768 dimensions (not Gemini's native 3072)**: 768 is the balance point — reduces storage and index cost by ~75% while retaining sufficient semantic fidelity for question clustering (not fine-grained retrieval). The code explicitly normalizes embeddings post-generation because Google requires normalization for non-3072 dimensions.

Expand Down Expand Up @@ -333,11 +333,11 @@ Pub/sub delivers only to currently connected subscribers. API restart = lost eve

**Mitigation**: Replace pub/sub with Redis Streams (`XADD`/`XREAD` with consumer groups). Streams persist messages and support "read from last acknowledged" semantics.

### 7.3 No pgvector Index on Vector Columns — **SEVERITY: MEDIUM (time-bomb)**
### 7.3 pgvector HNSW Index Tuning — **SEVERITY: LOW (resolved)**

Clustering worker's `ORDER BY centroid_embedding <=> :emb` performs sequential scan. Fine at O(100) clusters. Becomes a latency cliff at O(10,000+).
HNSW indexes have been added to all three vector columns (`comments.embedding`, `clusters.centroid_embedding`, `rag_documents.embedding`) using `vector_cosine_ops` with tuning parameters `m=16, ef_construction=64`. This resolves the original sequential scan concern.

**Mitigation**: Add HNSW index on `clusters.centroid_embedding` and `rag_documents.embedding`. Migration files exist for HNSW indexes but may not cover all vector columns.
**Remaining risk**: At very high scale (O(100K+) vectors per table), HNSW `ef_search` may need tuning for query latency. Monitor clustering worker query times.

### 7.4 YouTube Quota Exhaustion — **SEVERITY: MEDIUM**

Expand Down Expand Up @@ -445,7 +445,7 @@ Workers (100 connections) → PgBouncer (20 connections) → PostgreSQL
| Embeddings | Local model fallback (e.g., `all-MiniLM-L6-v2`) | Gemini cost > $100/day |
| YouTube quota | Quota tiering (premium teachers get 50K/day) | Teachers hitting limits daily |
| Worker autoscaling | Scale on queue depth (>100 tasks → spawn pod) | Sustained backlog |
| pgvector | Add HNSW index on vector columns | >10K clusters per session |
| pgvector | Tune HNSW `ef_search` parameter | >100K vectors per table |
| Redis | Redis Cluster (sharding) | >1GB memory or >50K ops/sec |
| API | Multiple uvicorn instances behind load balancer | >500 concurrent WebSocket connections |

Expand Down
Loading
Loading