A NestJS-based RAG (Retrieval-Augmented Generation) application using MongoDB for vector storage, Hugging Face embeddings, and Ollama for chat. It supports:
- User documents: Upload PDF/DOCX/TXT/MD, chunk and embed for RAG.
- Product catalog: Admin-managed product index (single or bulk via queue) for semantic search.
- RAG chat: Context is retrieved from both the user’s documents and the global product catalog; responses are generated with Ollama (standard or SSE streaming).
Auth is JWT-based; document and chat endpoints require a logged-in user; product management is superadmin-only.
- Docker & Docker Compose
- (Optional) Make — for the commands below
make up
# or: docker compose up -dThis starts the API, MongoDB, Redis (for product indexing queue), Mongo Express, and Ollama. The API waits for MongoDB and Redis to be healthy.
The LLM model is not included in the image. Pull it after the stack is up:
make ollama-pull
# or: docker compose exec ollama ollama pull llama3.2This pulls the default model (llama3.2). The first pull can take several minutes depending on your connection.
After the API container is running:
- The embedding model (e.g.
Xenova/all-MiniLM-L6-v2) loads on application startup. The first request that needs embeddings may be slow; give the app a minute or two after the health check passes before heavy use. - Ollama must have the model pulled (step 2) for chat and streaming to work.
Check health:
curl -s http://localhost:3000/healthmake ollama-models
# or: docker compose exec ollama ollama list| Command | Description |
|---|---|
make up |
Start all services (API, MongoDB, Redis, Mongo Express, Ollama) |
make dev |
Start only infra + Ollama (no API) for local development |
make down |
Stop all services |
make ollama-pull |
Pull Ollama model llama3.2 (run after make up) |
make ollama-models |
List installed Ollama models |
make logs |
Follow logs from all services |
make build |
Build the API Docker image |
make restart |
Restart all services |
make clean |
Stop everything and remove containers, volumes, images |
- Copy
.env.exampleto.envand adjust as needed. - MongoDB:
MONGODB_URIfor the app database. - Redis: Required for the product-index queue (
REDIS_HOST,REDIS_PORT). In Docker, the API usesredis:6379. - Ollama: In Docker, the API uses
OLLAMA_BASE_URL=http://ollama:11434(set incompose.yaml). For local runs, useOLLAMA_BASE_URL=http://localhost:11434and ensure Ollama is running and the model is pulled. SetOLLAMA_MODELto the model name you pull (e.g.llama3.2,phi3,gemma2:2b). - Ollama model (low-spec devices): If your machine has limited RAM/CPU, use a smaller model for better speed and stability. Try one of these (pull with
ollama pull <name>, then setOLLAMA_MODEL=<name>in.env):- phi3 (~2B) — good balance of quality and size, ~2 GB RAM.
- gemma2:2b — small, instruction-tuned; ~1.5 GB.
- qwen2:0.5b or qwen2:1.5b — very light; 0.5B is minimal, 1.5B a bit better.
- llama3.2:1b — smaller than default llama3.2 (3B); less RAM.
- tinyllama — 1.1B, very fast on weak hardware. Larger models (e.g. llama3.2 3B, mistral 7B) give better answers but need more RAM; if you see slow responses or OOM, switch to one of the smaller models above.
- JWT:
JWT_SECRETandJWT_EXPIRES_INfor auth. Superadmin users (for admin/product endpoints) are seeded via the app (see seed module).
- docs/RAG_FLOW.md — RAG flow, architecture (documents + product catalog + queue), and API behavior.
- Run
make up, thenmake ollama-pullso chat works. - Allow a short delay after the API is up for the embedding model to load.
- Redis must be running for bulk product indexing (queue).
- Use the Makefile as the main reference for run, pull, and debug commands.