DocAgent Studio

Local-first RAG with verifiable citations — offline, no API keys, designed for knowledge workers.

DocAgent Studio is a retrieval-augmented generation engine that runs entirely on your machine. It ingests your PDFs, Markdown, and Notion exports into a local SQLite database, retrieves context using hybrid lexical + vector search, and generates answers with verifiable, source-referenced citations — all without sending a single byte to the cloud.

Why not LangChain / LlamaIndex?

	DocAgent Studio	Typical RAG frameworks
Runs offline	Yes — SQLite + local embeddings + Ollama	Usually requires cloud APIs
Verifiable citations	Every sentence cites a traceable `source_ref` (e.g. `md:notes.md#L9`, `pdf:paper.pdf#p3`) that you can inspect with `docagent show`	Citations are often just "chunk text" with no stable reference
Retrieval	Hybrid — SQLite FTS5 (lexical) fused with cosine-similarity embeddings via tunable alpha	Typically vector-only or requires external services
No wrapper tax	Single `pip install`, zero config, no YAML chains, no API keys	Framework overhead, plugin systems, cloud key management
Self-correcting grounding	LLM output is validated against retrieved sources; ungrounded answers trigger a correction pass or fall back to extractive quotes	Trust the model output as-is

DocAgent Studio is not a framework — it is a complete, opinionated RAG engine built for people who want to search their own documents and get answers they can verify.

Key Features

Hybrid Retrieval (SQLite FTS5 + Embeddings)

Queries hit both a full-text search index (FTS5, inside the same SQLite DB) and a local embedding index (fastembed + NumPy cosine similarity). Scores are fused with a configurable alpha weight, so you get the precision of keyword matching and the recall of semantic search in a single pass.

Verifiable Citations with Source-Ref Tracking

Every chunk in the database has a stable source_ref — e.g. pdf:report.pdf#p7 or md:notes.md#L42. The LLM is required to cite these refs inline. You can verify any citation instantly:

docagent show --db ./data/docs.db --source-ref "md:notes.md#L9"

GraphRAG with Entity Co-occurrence

Build a lightweight knowledge graph from your corpus. Entities are extracted from chunks and linked by co-occurrence, enabling graph-based exploration of your documents:

docagent graph build --db ./data/docs.db
docagent graph query --db ./data/docs.db "Attachment"

Built-in Evaluation Metrics

Measure retrieval recall and citation coverage against a ground-truth eval set — no external tools needed:

docagent eval --db ./data/docs.db --eval ./eval/sample_eval.jsonl

Production CLI

A single docagent command covers the full workflow: ingest, index, ask, search, show, eval, graph, stats, doctor, and serve.

Architecture

                         +----------------+
                         |  User Query    |
                         +-------+--------+
                                 |
                    +------------+------------+
                    |                         |
              +-----v------+         +-------v-------+
              |  FTS5       |         |  Embeddings   |
              |  (lexical)  |         |  (vector)     |
              +-----+-------+         +-------+-------+
                    |                         |
                    +------------+------------+
                                 |
                      +----------v----------+
                      |  Score Fusion       |
                      |  a*vec + (1-a)*lex  |
                      +----------+----------+
                                 |
                      +----------v----------+
                      |  Top-K Chunks       |
                      |  w/ source_refs     |
                      +----------+----------+
                                 |
                      +----------v----------+
                      |  Ollama LLM         |
                      |  (local inference)  |
                      +----------+----------+
                                 |
                      +----------v----------+
                      |  Grounding Check    |
                      |  + Self-Correction  |
                      +----------+----------+
                                 |
                      +----------v----------+
                      |  Cited Answer       |
                      |  [source_ref]       |
                      +---------------------+

All data lives in a single SQLite database + two NumPy sidecar files. No external services required.

Quick Start

1. Install

python3 -m venv .venv && source .venv/bin/activate
pip install -e .

For the web UI: pip install -e '.[web]'

2. Ingest your documents

docagent ingest --input /path/to/your/docs --db ./data/docs.db

Supports *.pdf, *.md, *.markdown, *.txt. For Notion exports, unzip the Markdown export and point --input at the folder.

3. Build the search index

docagent index --db ./data/docs.db

4. Ask a question

ollama pull llama3.2:1b  # one-time setup
docagent ask --db ./data/docs.db "What did I write about attachment theory?"

5. Verify a citation

docagent show --db ./data/docs.db --source-ref "md:notes.md#L9"

Web UI

docagent serve --db ./data/docs.db
# Open http://127.0.0.1:8000

CLI Reference

Command	Description
`docagent ingest`	Ingest PDFs + Markdown into a local SQLite DB
`docagent index`	Build FTS5 + embedding indexes
`docagent ask`	Ask a question and get a cited answer
`docagent search`	Debug retrieval — show top-K chunks with scores
`docagent show`	Inspect a specific chunk by `source_ref` or `chunk_id`
`docagent eval`	Evaluate retrieval recall and citation coverage
`docagent graph build`	Build entity co-occurrence graph
`docagent graph query`	Explore entities and neighbors
`docagent stats`	Show corpus statistics
`docagent doctor`	Check local dependencies and print fixes
`docagent serve`	Launch the web UI
`docagent make-trainset`	Export instruction JSONL for LoRA/SFT
`docagent make-trainset-dir`	Export train/valid/test splits for MLX LoRA

Debugging

docagent doctor --db ./data/docs.db    # check Ollama + DB health
docagent search --db ./data/docs.db "secure base" --k 5   # inspect retrieval

Knowledge Graph (GraphRAG)

docagent graph build --db ./data/docs.db
docagent graph query --db ./data/docs.db "Attachment"

Builds a lightweight entity co-occurrence graph stored in SQLite. Entities are extracted via NLP heuristics and linked by shared chunk presence.

Evaluation

Create an eval set as JSONL:

{"question":"...","answer":"...","required_sources":["pdf:foo.pdf#p12"]}

Run:

docagent eval --db ./data/docs.db --eval ./eval/sample_eval.jsonl

A starter eval set is included: eval/sample_eval.jsonl.

Training Data Export (Optional)

Export instruction datasets for LoRA/SFT fine-tuning:

# Single JSONL file
docagent make-trainset --db ./data/docs.db --out ./train.jsonl --n 500

# Train/valid/test splits (e.g., for MLX LoRA)
docagent make-trainset-dir --db ./data/docs.db --out-dir ./data/trainset --n 2000

Example MLX LoRA fine-tuning on Apple Silicon:

mlx_lm.lora \
  --model mlx-community/Llama-3.2-1B-Instruct-4bit \
  --train --data ./data/trainset \
  --iters 300 --batch-size 1 --learning-rate 1e-4 \
  --adapter-path ./data/adapters/docagent-lora \
  --grad-checkpoint

Design Decisions

SQLite as the single source of truth — chunks, FTS index, documents table, and graph all live in one .db file. Portable, inspectable, no server.
Brute-force NumPy cosine similarity — fast enough for personal corpora (thousands of chunks). No FAISS/Chroma dependency.
Self-correcting grounding loop — if the LLM produces ungrounded citations or URLs, a correction pass fires automatically. If that also fails, the system returns extractive quotes with citations rather than hallucinated text.
Runs on a MacBook Air M2 with 8 GB RAM.

Tests

python -m unittest discover -s tests -p 'test_*.py'

Paper

See docs/paper.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
docs		docs
eval		eval
example_data		example_data
scripts		scripts
src/docagent		src/docagent
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocAgent Studio

Why not LangChain / LlamaIndex?

Key Features

Hybrid Retrieval (SQLite FTS5 + Embeddings)

Verifiable Citations with Source-Ref Tracking

GraphRAG with Entity Co-occurrence

Built-in Evaluation Metrics

Production CLI

Architecture

Quick Start

1. Install

2. Ingest your documents

3. Build the search index

4. Ask a question

5. Verify a citation

Web UI

CLI Reference

Debugging

Knowledge Graph (GraphRAG)

Evaluation

Training Data Export (Optional)

Design Decisions

Tests

Paper

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocAgent Studio

Why not LangChain / LlamaIndex?

Key Features

Hybrid Retrieval (SQLite FTS5 + Embeddings)

Verifiable Citations with Source-Ref Tracking

GraphRAG with Entity Co-occurrence

Built-in Evaluation Metrics

Production CLI

Architecture

Quick Start

1. Install

2. Ingest your documents

3. Build the search index

4. Ask a question

5. Verify a citation

Web UI

CLI Reference

Debugging

Knowledge Graph (GraphRAG)

Evaluation

Training Data Export (Optional)

Design Decisions

Tests

Paper

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages