A minimal, end-to-end Retrieval-Augmented Generation (RAG) demo built with Next.js. Upload a .txt file, ask a question, and get a grounded answer streamed back — powered by local embeddings and the You.com Express Agent.
Large language models are powerful, but they only know what was in their training data. When you need answers grounded in your documents — not general knowledge — you need RAG.
RAG solves this in three steps:
① INDEXING (per request)
Your Document → [Chunk & Embed] → In-Memory Vector Store
② QUERYING
Your Question → [Embed] → Question Vector
↓
Compare against Vector Store
↓
Top Chunks
(most semantically similar passages)
↓
③ Your Question + Top Chunks → [You.com Express Agent] → Streamed Answer
The uploaded document is split into small, overlapping chunks of text. Each chunk is converted into a vector embedding — an array of numbers (e.g. [0.21, -0.87, 0.54, ...]) that represents the meaning of that text. Similar ideas produce similar numbers, which is what makes semantic search possible.
Unlike a traditional RAG system, this demo holds the index entirely in memory for the duration of the request — no database, no disk writes.
When a user asks a question, that question is run through the same embedding model to produce its own vector. The in-memory vector store compares it against all stored document vectors and returns the closest matches — the chunks most semantically related to the question.
This is why "What movies involve a kid and a father?" can surface content about Interstellar even if the word "father" never appears in the text.
The retrieved chunks are inserted into a prompt alongside the original question. The You.com Express Agent reads that context and streams a grounded answer back to the browser — without hallucinating facts it doesn't have.
You.com's Express Agent is a fast, capable LLM endpoint designed for agentic and RAG workflows. In this pipeline it handles the generation step — it receives the retrieved context and question, then streams a precise, grounded answer.
Key advantages:
- Streaming by default — responses stream token-by-token, keeping latency low
- No hallucination pressure — explicitly prompted to answer only from provided context
- No mandatory web search — tools are opt-in, so the agent stays focused on your documents
- Simple API — one SDK call with the
expressagent
app/
page.tsx # UI: file upload, query input, streamed response
api/
query/
route.ts # API route: embed → retrieve → generate pipeline
public/
classroom-favorite-movies.txt # Sample document
Accepts a POST request with { files, query, apiKey } and runs all three RAG steps server-side.
Embedding model: BAAI/bge-small-en-v1.5 runs entirely on-device via @llamaindex/huggingface — no data leaves your machine during the embedding step. The model is bundled with the deployment under /models so no download is needed at runtime.
In-memory index: Unlike a persistent RAG system, there's no disk storage. VectorStoreIndex.fromDocuments() is called without a storageContext, so the entire index lives in memory for the lifetime of the request. This keeps the architecture simple and stateless.
// No storageContext → stays entirely in memory, gone after the request
return VectorStoreIndex.fromDocuments(documents);Chunking & retrieval: LlamaIndex's VectorStoreIndex handles splitting the document into chunks and cosine-similarity search. retrieve() returns the top-3 most relevant chunks, which are formatted into a numbered context block and injected into the You.com prompt.
Streaming: The You.com Express Agent streams its response token-by-token. The route forwards that stream directly to the browser as a text/plain ReadableStream:
const stream = await you.agentsRuns({ agent: "express", input, stream: true });
for await (const chunk of stream) {
if (chunk.data.type === "response.output_text.delta") {
controller.enqueue(encoder.encode(chunk.data.response.delta));
}
}The browser handles everything else:
- API key input (passed directly in the request — no
.env.localneeded) - File selection, including multiple files, via the File API (or loading the built-in example)
- Reading files as text and POSTing them to
/api/query - Reading the streamed response and appending tokens to the UI in real time
1. Install dependencies
npm install2. Run the dev server
npm run devOpen localhost:3000.
3. Try it
Enter your You.com API key (get one at you.com/platform), upload one or more .txt files (or click "Use our example"), and ask a question. The example file is a short story about a fifth-grade class sharing their favorite movies — try asking:
- "What is Elijah's favorite movie?"
- "Which students like animated films?"
- "Who relates most to their favorite movie character?"
| Package | Purpose |
|---|---|
next |
App framework — UI and API routes |
llamaindex |
Document chunking, vector store, and similarity search |
@llamaindex/huggingface |
On-device embeddings via BAAI/bge-small-en-v1.5 |
@youdotcom-oss/sdk |
You.com Express Agent for answer generation |