diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..011a657 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,425 @@ +# NetColab Architecture Design + +## Overview + +NetColab is a **collaborative agentic workspace** where multiple users share a real-time session to jointly steer and interact with AI models. The architecture must support: + +- Multi-user real-time presence and message sync +- Multi-model AI routing (Claude, GPT-5, Gemini) +- Queued/batched prompt workflows +- Version history and session auditability +- Scalable room/workspace management + +--- + +## System Architecture + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Client Layer │ +│ React 18 + TypeScript + Vite │ +│ React Query (server state) · Socket.io-client (real-time) │ +│ shadcn/ui · TailwindCSS · Framer Motion │ +└────────────────────────────┬─────────────────────────────────────┘ + │ HTTPS / WSS +┌────────────────────────────▼─────────────────────────────────────┐ +│ API Server (BFF) │ +│ Node.js + Hono (TypeScript-native, edge-ready) │ +│ ├── REST endpoints (/api/v1/...) │ +│ ├── WebSocket server (Socket.io) │ +│ ├── Auth middleware (JWT) │ +│ └── Rate limiting · CORS · Input validation │ +└──────┬──────────────────┬────────────────────┬───────────────────┘ + │ │ │ +┌──────▼──────┐ ┌────────▼────────┐ ┌───────▼──────────┐ +│ Auth │ │ Room / Chat │ │ AI Proxy │ +│ Service │ │ Service │ │ Service │ +│ │ │ │ │ │ +│ JWT issue │ │ CRUD rooms │ │ Route by model │ +│ Refresh │ │ Message store │ │ Anthropic API │ +│ tokens │ │ Version log │ │ OpenAI API │ +│ Sessions │ │ Queue flush │ │ Google AI API │ +└──────┬──────┘ └────────┬────────┘ └───────────────────┘ + │ │ +┌──────▼──────────────────▼────────────────────────────────┐ +│ PostgreSQL Database │ +│ users · rooms · room_participants · messages · versions │ +└──────────────────────────────────────────────────────────┘ + │ │ +┌──────▼──────┐ ┌────────▼────────┐ +│ Redis │ │ Object Storage │ +│ │ │ (file uploads, │ +│ Pub/Sub │ │ attachments) │ +│ Sessions │ │ S3-compatible │ +│ Presence │ └─────────────────┘ +│ Caching │ +└─────────────┘ +``` + +--- + +## Technology Stack + +### Frontend (existing) +| Layer | Choice | Reason | +|-------|--------|--------| +| Framework | React 18 + TypeScript | Already in place | +| Build tool | Vite | Already in place | +| UI components | shadcn/ui (Radix) + TailwindCSS | Already in place | +| Server state | TanStack React Query v5 | Already in place; pairs cleanly with REST APIs | +| Animations | Framer Motion | Already in place | +| Real-time | Socket.io-client | Pairs with Socket.io server; handles reconnection automatically | +| Forms | React Hook Form + Zod | Already in place | +| Routing | React Router v6 | Already in place | + +### Backend (to be built) +| Layer | Choice | Reason | +|-------|--------|--------| +| Runtime | Node.js 22 (LTS) | Same ecosystem as frontend; large package ecosystem | +| Framework | **Hono** | TypeScript-first, fast, minimal; runs on Node.js HTTP server (note: Socket.io requires a raw Node HTTP server — edge/serverless deployment is not possible with this stack) | +| WebSockets | **Socket.io** | Rooms, namespaces, reconnection, presence broadcast built-in | +| Database | **PostgreSQL 16** | Relational integrity for rooms/messages/users; JSONB for flexible metadata | +| ORM | **Drizzle ORM** | TypeScript-native, schema-as-code, lightweight, great DX | +| Cache / PubSub | **Redis** | Real-time pub/sub for cross-instance message broadcast; session store; presence TTL | +| Auth | **JWT + refresh tokens** | Access tokens are stateless (short-lived, 15 min); refresh tokens are stored in Redis for revocation support | +| AI routing | **Anthropic SDK + OpenAI SDK + Google Generative AI SDK** | One SDK per provider; routed by model identifier in request | +| Package manager | **npm** (existing) | Consistent with frontend; Bun can be adopted later if performance warrants it | + +### Infrastructure +| Concern | Choice | +|---------|--------| +| Containerization | Docker + Docker Compose (dev), single Dockerfile per service | +| Hosting | Railway / Render / Fly.io (low-ops), or AWS ECS/GCP Cloud Run for scale | +| Object storage | Cloudflare R2 or AWS S3 | +| Secrets | Environment variables via `.env`; never committed | + +--- + +## Data Model + +### PostgreSQL Schema + +```sql +-- Users +CREATE TABLE users ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + email TEXT UNIQUE NOT NULL, + name TEXT NOT NULL, + avatar TEXT, -- image URL; NULL means use generated initials fallback + color TEXT NOT NULL CHECK (color IN ('cyan', 'pink', 'amber', 'violet', 'green')), + created_at TIMESTAMPTZ DEFAULT now() +); + +-- Chat Rooms +CREATE TABLE rooms ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + title TEXT NOT NULL, + emoji TEXT NOT NULL DEFAULT '💬', + created_by UUID NOT NULL REFERENCES users(id), + created_at TIMESTAMPTZ DEFAULT now(), + last_activity TIMESTAMPTZ DEFAULT now() +); + +-- Room membership +CREATE TABLE room_participants ( + room_id UUID REFERENCES rooms(id) ON DELETE CASCADE, + user_id UUID REFERENCES users(id) ON DELETE CASCADE, + joined_at TIMESTAMPTZ DEFAULT now(), + PRIMARY KEY (room_id, user_id) +); + +-- Messages +CREATE TABLE messages ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + room_id UUID NOT NULL REFERENCES rooms(id) ON DELETE CASCADE, + user_id UUID REFERENCES users(id), -- NULL for ai messages + content TEXT NOT NULL, + type TEXT NOT NULL CHECK (type IN ('user', 'ai', 'queued')), + -- 'queued' means staged but not yet sent to AI; becomes 'user' input context after flush + model TEXT, -- ai model identifier (e.g. "claude-sonnet-4-5") + created_at TIMESTAMPTZ DEFAULT now() +); + +-- Version / audit log +CREATE TABLE versions ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + room_id UUID NOT NULL REFERENCES rooms(id) ON DELETE CASCADE, + user_id UUID NOT NULL REFERENCES users(id), + action TEXT NOT NULL CHECK (action IN ('message_sent', 'message_queued', 'flush', 'participant_joined', 'participant_removed', 'room_updated')), + message_id UUID REFERENCES messages(id) ON DELETE SET NULL, -- SET NULL, not CASCADE, to preserve audit history + snapshot TEXT, -- optional JSON snapshot of message content at time of action + created_at TIMESTAMPTZ DEFAULT now() +); + +-- Indexes +CREATE INDEX idx_messages_room_id ON messages(room_id, created_at); +CREATE INDEX idx_versions_room_id ON versions(room_id, created_at); +CREATE INDEX idx_room_participants_user ON room_participants(user_id); +``` + +--- + +## REST API Design + +Base path: `/api/v1` + +### Authentication +| Method | Path | Description | +|--------|------|-------------| +| POST | `/auth/register` | Create user account | +| POST | `/auth/login` | Issue JWT + refresh token | +| POST | `/auth/refresh` | Rotate access token | +| POST | `/auth/logout` | Revoke refresh token | +| GET | `/auth/me` | Current user profile | + +### Rooms +| Method | Path | Description | +|--------|------|-------------| +| GET | `/rooms` | List user's rooms | +| POST | `/rooms` | Create new room | +| GET | `/rooms/:id` | Room details + participants | +| PATCH | `/rooms/:id` | Update title / emoji | +| DELETE | `/rooms/:id` | Delete room (owner only) | +| POST | `/rooms/:id/participants` | Invite user | +| DELETE | `/rooms/:id/participants/:userId` | Remove participant | + +### Messages +| Method | Path | Description | +|--------|------|-------------| +| GET | `/rooms/:id/messages` | Paginated message history | +| POST | `/rooms/:id/messages` | Post user or queued message | +| PATCH | `/rooms/:id/messages/:msgId` | Promote queued → user (partial flush) | +| POST | `/rooms/:id/flush` | Flush all queued → AI | + +### Versions +| Method | Path | Description | +|--------|------|-------------| +| GET | `/rooms/:id/versions` | Audit log for room | + +### AI Proxy +| Method | Path | Description | +|--------|------|-------------| +| POST | `/ai/complete` | Route prompt to selected model (SSE streaming) | + +All list endpoints support `?limit=20&cursor=` cursor-based pagination. + +--- + +## Real-Time Architecture (Socket.io) + +Clients join a Socket.io room matching their `roomId`. The server broadcasts events: + +``` +Client connects → authenticates with JWT via socket handshake + → joins room channel: socket.join(`room:${roomId}`) + +Events (server → clients in room): + message:new { message } — new user/ai/queued message + message:updated { message } — queued→user type change + presence:join { userId, name } — user entered the room + presence:leave { userId } — user left / disconnected + ai:typing { model } — AI is generating + ai:done { messageId } — AI response complete + +Events (client → server): + message:send { content, model, type } + message:queue { content, model } + room:flush { } +``` + +**Presence via Redis TTL:** +- On join, `SET presence:{roomId}:{userId} 1 EX 30` +- Heartbeat every 20s to renew TTL +- On expiry/disconnect, broadcast `presence:leave` + +**Multi-instance scaling:** +- Socket.io Redis adapter broadcasts events across all server instances +- Each instance connects to same Redis pub/sub channel + +--- + +## AI Proxy Service + +The AI Proxy routes requests to the correct provider SDK based on the `model` field: + +``` +model: "claude-sonnet-4-5" → Anthropic SDK +model: "claude-opus-4-5" → Anthropic SDK +model: "gpt-5" → OpenAI SDK +model: "gpt-5-mini" → OpenAI SDK +model: "gemini-2.5-flash" → Google Generative AI SDK +model: "gemini-2.5-pro" → Google Generative AI SDK +``` + +Responses are streamed as **Server-Sent Events (SSE)** from the `/ai/complete` endpoint to the requesting client. The server then broadcasts the completed (or chunk-by-chunk) AI response to all room members over the existing Socket.io room channel so every participant sees the output in real time. Streaming chunks should be buffered server-side if broadcasting per-chunk to avoid overwhelming Socket.io under high-throughput models. + +### Batch (flush) flow +1. Client calls `POST /rooms/:id/flush` +2. Server loads all `type=queued` messages for the room, sorted by `created_at` +3. Concatenates content with user attribution: `[UserA]: prompt1\n\n[UserB]: prompt2` +4. **Check combined token count against the selected model's context window limit before sending.** If over the limit, return a 422 with a descriptive error rather than letting the provider reject it. +5. Sends combined context to selected model +6. Saves AI response as a new `type=ai` message; records a `flush` audit entry in `versions` +7. Broadcasts `message:new` (AI response) to all room members via Socket.io +8. **Failure handling:** wrap steps 5–6 in a database transaction. If the AI call fails, no partial records are written; the queued messages remain intact for retry. + +--- + +## Project Structure (proposed) + +``` +NetColab/ +├── src/ # Frontend (existing) +│ ├── components/ +│ ├── pages/ +│ ├── data/ # Mock data → replaced by API calls +│ ├── hooks/ +│ ├── lib/ +│ └── types/ +│ +├── server/ # Backend (new) +│ ├── src/ +│ │ ├── index.ts # Entry point, Hono app + Socket.io +│ │ ├── routes/ +│ │ │ ├── auth.ts +│ │ │ ├── rooms.ts +│ │ │ ├── messages.ts +│ │ │ └── ai.ts +│ │ ├── services/ +│ │ │ ├── auth.service.ts +│ │ │ ├── room.service.ts +│ │ │ ├── message.service.ts +│ │ │ └── ai.service.ts +│ │ ├── db/ +│ │ │ ├── schema.ts # Drizzle schema +│ │ │ ├── client.ts # DB connection +│ │ │ └── migrations/ +│ │ ├── realtime/ +│ │ │ └── socket.ts # Socket.io event handlers +│ │ ├── middleware/ +│ │ │ ├── auth.ts # JWT verification +│ │ │ └── validate.ts # Zod request validation +│ │ └── lib/ +│ │ ├── redis.ts +│ │ └── env.ts # Validated env vars (zod) +│ ├── package.json +│ └── tsconfig.json +│ +├── docs/ +│ └── ARCHITECTURE.md # This file +│ +├── docker-compose.yml # Local dev: postgres + redis +├── .env.example +└── package.json # Root-level npm workspaces config (shared types between frontend and backend) +``` + +> **Shared types:** Define a `packages/shared` (or `server/src/types`) package exported via npm workspaces so that the frontend and backend can import the same TypeScript interfaces for API request/response payloads and Socket.io event payloads. This prevents type drift between the two layers. + +--- + +## Implementation Milestones + +### Phase 1 — Backend Foundation +- [ ] Scaffold `server/` with Hono + TypeScript +- [ ] Docker Compose for PostgreSQL + Redis +- [ ] Drizzle schema + migrations +- [ ] Auth endpoints (register, login, JWT refresh) + +### Phase 2 — Room & Message APIs +- [ ] CRUD for rooms and participants +- [ ] Message persistence (send, queue, paginated history) +- [ ] Version/audit log endpoint + +### Phase 3 — Real-Time +- [ ] Socket.io server integration +- [ ] Presence tracking with Redis TTL +- [ ] Room broadcast for new messages and presence events + +### Phase 4 — AI Integration +- [ ] Multi-model AI proxy service +- [ ] SSE streaming responses +- [ ] Batch flush endpoint + +### Phase 5 — Frontend Migration +- [ ] Replace mock data with React Query + REST calls +- [ ] Connect Socket.io client for live updates +- [ ] Add authentication UI (login/register) + +### Phase 6 — Polish & Production +- [ ] Rate limiting on AI endpoints +- [ ] E2E tests (Playwright: workspace create, chat, flush) +- [ ] CI/CD pipeline +- [ ] Deployment (Railway / Fly.io) + +--- + +## Scalability Considerations + +| Concern | Approach | +|---------|----------| +| Horizontal scaling | Stateless API servers; Redis adapter for Socket.io | +| AI cost control | Rate limit AI calls per room per minute; queue depth cap | +| Message volume | Cursor-based pagination; archive old messages to cold storage | +| DB connections | Connection pooling via PgBouncer or the underlying `postgres.js` driver's built-in pool (Drizzle itself does not provide pooling) | +| Real-time load | Redis pub/sub decouples message fan-out from API servers | + +--- + +## Security Considerations + +- JWT access tokens (15 min TTL) + refresh tokens (7 days, Redis-stored for revocation) +- All AI API keys server-side only +- Input validation with Zod on all endpoints +- Parameterized queries via Drizzle (no raw SQL interpolation) +- Rate limiting: 60 req/min per IP, 10 AI calls/min per room +- CORS restricted to known frontend origin +- HTTPS enforced in production + +--- + +## Potential Challenges & Points to Notice + +### 1. Hono + Socket.io Integration +Hono is designed around a request/response model and does not natively manage a raw `http.Server`. Socket.io requires control of the underlying Node.js `http.Server` to upgrade HTTP connections to WebSockets. You must create the `http.Server` explicitly, attach Hono as the request handler, and pass the same server instance to `new Server(httpServer)`. Deploying Hono to edge workers (Cloudflare Workers) is therefore **incompatible** with Socket.io — serverless/edge deployment must be ruled out for any instance running the WebSocket server. + +### 2. AI Streaming: SSE vs. WebSocket Fan-out +The `/ai/complete` endpoint streams via SSE to the requesting client, but all room participants need to see the AI response live. The server must act as a relay: consume the SSE stream from the AI provider, buffer or forward chunks, and broadcast them over the Socket.io room channel. Decide early whether to: +- Broadcast each token chunk (low latency but high Socket.io message rate) +- Buffer into sentences / flush intervals (smoother UX, slightly higher latency) + +Without this relay design the non-requesting participants only see the AI response after it is fully saved. + +### 3. Batch Flush: Context Window Limits +The flush flow concatenates all queued messages into a single prompt. Large rooms with many queued messages can easily exceed an AI model's context window (e.g. GPT-4o: 128k tokens, Gemini 2.5 Pro: 1M tokens, Claude: 200k tokens). Without explicit token counting (using `tiktoken` or provider-specific counting APIs) before the API call, the server will receive a cryptic provider error at runtime. Add a token budget check and return a user-friendly error or prompt the user to reduce the queue. + +### 4. Presence Heartbeat vs. TTL Timing +The presence TTL is set to 30 seconds and the heartbeat fires every 20 seconds. A single missed heartbeat (due to a brief network stall) leaves only 10 seconds before the TTL expires, causing a false `presence:leave` event. Recommendations: +- Increase the TTL to at least 3× the heartbeat interval (e.g. 60 s TTL / 20 s heartbeat) +- On reconnect, re-broadcast `presence:join` to reconcile state + +### 5. Multi-Instance Presence Race Conditions +When a user is connected to instance A and their TCP connection drops, instance A fires `presence:leave`. If the client simultaneously reconnects to instance B, both events travel through Redis pub/sub. Without a distributed lock or a reconciliation step (check Redis presence key before broadcasting leave), clients in the room can see a spurious leave/rejoin flicker. Use a small delay (e.g. 2–3 s) before broadcasting `presence:leave` and cancel it if a new connection for the same `userId`/`roomId` appears. + +### 6. Audit Trail Integrity on Message Deletion +The `versions` table references `messages(id)` with `ON DELETE SET NULL` (after the schema fix above). However, the full content of deleted messages is then lost from the audit trail. If regulatory or collaborative-replay requirements exist, store a `snapshot TEXT` column in `versions` to capture the message content at the time of the action. This is already reflected in the schema above. + +### 7. `last_activity` Update Pattern in `rooms` +The `last_activity` column must be kept up-to-date on every new message insert. Without a PostgreSQL trigger, the application must issue a separate `UPDATE rooms SET last_activity = now() WHERE id = $roomId` alongside every `INSERT INTO messages`. Consider adding a trigger to enforce this automatically and avoid drift if a direct DB insert bypasses the application layer. + +### 8. No OAuth / Social Login +The current auth design only covers email + password. Most users expect "Sign in with Google" (or GitHub). Retrofitting OAuth after the auth service is built is non-trivial: the `users` table needs a `provider` and `provider_id` column, and the JWT flow changes. Plan the schema for it from Phase 1 even if the UI for it ships in Phase 5. + +### 9. WebSocket Rate Limiting +REST endpoints are rate-limited but Socket.io events (`message:send`, `message:queue`) are not. A compromised or buggy client can flood the room with hundreds of events per second. Add per-socket rate limiting in the Socket.io middleware layer (e.g. `socket-ratelimiter` or a simple Redis counter keyed by `userId`). + +### 10. Shared Types Between Frontend and Backend +As the API grows, request/response payload types and Socket.io event shapes will be duplicated across `src/` (frontend) and `server/src/` (backend). Introduce an npm workspace (root `package.json` with `"workspaces": ["src", "server", "packages/shared"]`) and a `packages/shared` module from the start to keep types in sync. + +### 11. Testing Strategy Gap +Phases 1–5 define no unit or integration tests. E2E tests in Phase 6 will catch regression too late. Recommendation: +- Add unit tests for services (auth, AI routing, queue logic) from Phase 1 using Vitest +- Add integration tests for REST endpoints (using `supertest` or Hono's test helpers) from Phase 2 +- Reserve Playwright E2E for Phase 6 + +### 12. Model Identifier Fragility +The `model` column stores free-text identifiers (e.g. `"gpt-5"`). If a model is renamed or versioned by its provider, historical records will reference an identifier that no longer resolves. Consider pinning to canonical, versioned identifiers (e.g. `"claude-sonnet-4-5-20250915"`) and maintaining a mapping table or config file for display names.