HelpmateAI is a document-aware RAG system for long PDFs and DOCX files. It plans retrieval over document topology instead of treating every question as a flat dense top-k search. It is built for the questions where ordinary "chat with PDF" systems break: broad thesis conclusions, research-paper contributions, policy clauses, scattered evidence, weak retrieval, and citation-sensitive answers.
Live landing page: https://helpmateai.xyz; Workspace app: https://app.helpmateai.xyz
Most RAG demos retrieve the top chunks and hope the answer model can stitch them together. HelpmateAI treats retrieval as a planned workflow over a structured document map.
| Typical RAG failure | HelpmateAI behavior |
|---|---|
| "What are the conclusions?" returns a few random result paragraphs. | A dedicated global_summary_first route anchors overview, findings, discussion, and conclusion regions before assembling raw chunk evidence. |
| The model answers even when retrieval is weak. | Evidence is graded as strong, weak, or unsupported; unsupported questions stop before answer generation. |
| Section-scoped questions drift into the wrong chapter or policy region. | A bounded orchestrator can resolve explicit local scope to validated section IDs, with deterministic safety checks. |
| The right chunk appears in top-k but not at rank 1. | A spread-triggered, reorder-only evidence selector can promote stronger evidence without pruning away support. |
| Architecture changes are chosen by intuition. | The repo carries ADRs, ablations, and benchmark reports for retrieval, reranking, planning, abstention, and evidence selection. |
| Workspace | Answer panel |
|---|---|
![]() |
![]() |
The latest saved held-out product-fit run is final_eval_suite_20260429_193058.json. It used 50 fixed questions across five public documents, running HelpmateAI only in native-context mode and judging with RAGAS using Gemini 2.5 Flash plus OpenAI embeddings.
That run showed the current shape of the system clearly:
| Metric | HelpmateAI |
|---|---|
| Questions | 50 |
| Answerable questions | 45 |
| Unsupported questions | 5 |
| Supported rate | 0.7200 |
| Answerable supported rate | 0.8000 |
| Unsupported abstention rate | 1.0000 |
| False support rate | 0.0000 |
| False abstention rate | 0.2000 |
| RAGAS faithfulness | 0.9334 |
| RAGAS faithfulness, attempted only | 0.9600 |
| RAGAS answer relevancy, attempted only | 0.7892 |
| RAGAS context precision, attempted only | 0.9093 |
The native-context scoring fixes an earlier evaluation-methodology issue where HelpmateAI generated from its full selected evidence while RAGAS judged against a clipped context payload. This latest run is the current HelpmateAI product score; earlier vendor rows remain useful as historical comparisons but should not be mixed into a single headline table without rerunning all systems under the same scoring mode.
Evaluation is treated as part of the architecture, not a one-off demo. The current final-eval harness uses fixed public documents, fixed question manifests, answerable and intentionally unsupported questions, per-intent reporting, and saved machine-readable reports under docs/evals/reports/.
The latest held-out suite uses:
- public source documents recorded in final_eval_sources_20260428.md
- fixed draft questions in final_eval_questions.draft.json
- RAGAS scoring with a non-generator judge model where configured
- explicit abstention metrics alongside answer-quality metrics
- separate native-context and equalized-context modes for future product and controlled retrieval comparisons
- documented vendor comparison settings when OpenAI File Search or Vectara baselines are run
Full protocol details live in final_eval_protocol.md, with the broader evaluation plan in next_steps_and_final_eval_plan.md.
The retrieval core lives in src/ and stays framework-agnostic. backend/ exposes it through FastAPI upload, index, status, and ask endpoints. frontend/ ships the Next.js workspace UI. deploy/vps/ contains the Docker Compose and Caddy deployment path for the API, while the public app is split between landing, workspace, and backend surfaces.
Built with Next.js, FastAPI, pypdf, python-docx, optional Docling, ChromaDB, OpenAI, sentence-transformers, scikit-learn, optional Supabase persistence, optional hosted Chroma-compatible storage, Docker, and uv.
PDF extraction defaults to HELPMATE_PDF_EXTRACTOR=pypdf for reliability. DOCX extraction defaults to HELPMATE_DOCX_EXTRACTOR=python-docx. Set either extractor to docling only for local layout-parser experiments; production stays on the predictable local extractors.
Docling OCR is disabled by default (HELPMATE_DOCLING_OCR=false) to avoid unnecessary memory pressure on born-digital PDFs. It can be enabled for scanned PDFs when the runtime has enough memory. Docling runs with expanded Markdown tables and records OCR/table-mode metadata when explicitly enabled.
HelpmateAI is strongest on grounded long-document QA, policy questions, thesis/report navigation, and citation-visible answers. The hardest remaining cases are the broadest academic synthesis prompts on noisy journal-style PDFs, plus broader held-out coverage for orchestrated local-scope behavior.



