Robust, model-agnostic toolkit to rewrite/summarize/extract academic content with optional RAG (retrieval-augmented generation) over your own PDF library. Works with OpenAI and OpenAI-compatible local endpoints (Ollama, LM Studio, vLLM, oobabooga’s OpenAI API mode).
- 📥 Ingest PDFs into a local ChromaDB index with token-aware chunking (prevents 512-token embedding errors).
- 🔎 RAG: retrieve top-K relevant snippets per chunk and auto-inject them into prompts.
- ✍️ Multi-task processor: rewrite, edit+review, extract structured data, outline, figure suggestions, etc.
- 🧩 Local LLMs: point the OpenAI SDK at a localhost
/v1endpoint viaOPENAI_BASE_URL. - 📡 Optional streaming (CLI flag) for live output in interactive runs.
- 🧮 Safe token counting (
tiktoken) and overlap-aware chunking for long inputs.
.
├── paper_rewriter_plus.py # main CLI script (RAG + streaming capable)
├── requirements.txt # Python dependencies
├── README.md # this file
├── input_pdfs/ # put your PDFs here (for ingest)
├── ragdb/ # ChromaDB persistent dir (created on ingest)
└── outputs/ # generated text outputs (created on process)
Python 3.10+ recommended. Create a virtual environment to keep things clean.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtsentence-transformers installs a default CPU torch. For GPU acceleration, install a platform-specific build from [pytorch.org] and then re-run pip install -r requirements.txt.
export OPENAI_API_KEY=sk-... # required
# Optional: choose model at runtime with --model(e.g., Ollama, vLLM, LM Studio, oobabooga OpenAI API)
export OPENAI_API_KEY=sk-local # any non-empty string
export OPENAI_BASE_URL=http://127.0.0.1:5000/v1
# or Ollama: export OPENAI_BASE_URL=http://localhost:11434/v1Some servers ignore
model; supply the exact loaded model name if needed.
-
Prepare input PDFs
- Put the articles you want the model to consider in
./input_pdfs/.
- Put the articles you want the model to consider in
-
Ingest to Chroma (build the RAG index)
python paper_rewriter_plus.py ingest \
--inputs ./input_pdfs \
--chroma_path ./ragdb \
--collection papers \
--embed_model "BAAI/bge-large-en-v1.5" \
--chunk_chars 1000 --chunk_overlap 30 \
--embed_token_limit 480--embed_token_limitkeeps each embedding chunk at ≤480 tokens (safe for BGE-large’s 512 limit).- Telemetry is disabled automatically to avoid noisy warnings.
- Rewrite a document with RAG enabled
python paper_rewriter_plus.py process \
--input my_paper.pdf \
--outdir outputs \
--task rewrite_academic \
--model gpt-4o-mini \
--rag --chroma_path ./ragdb --collection papers \
--k 6 --context_tokens 1200This produces:
outputs/my_paper.rewrite_academic.stitched.txt– the concatenated resultoutputs/my_paper.rewrite_academic.partXXX.txt– per-chunk outputs
- (Optional) Stream tokens during processing
python paper_rewriter_plus.py process ... --streamStreaming is for UX; output files are written only after each chunk completes.
rewrite_academic– high-quality rewrite with citation markers[n]edit_and_review– polish + section title + notessummary_cim– ≤650-char executive summaryextract_quanti– table of chemicals/materials/devices/p-valuesoutline– hierarchical outline (H1–H4)figure_suggestions– suggested figure slots + graded captionscitation_scout– add[Ref Needed]and search keywordspeer_review– reviewer-style critique
You can run multiple tasks by repeating
--taskor comma-separating (e.g.,--task rewrite_academic --task figure_suggestions).
- Ingest: PDFs are converted to text, split with a token-aware splitter so each chunk fits the embedding model’s positional limit, then embedded with Sentence-Transformers (default:
BAAI/bge-large-en-v1.5) and stored in Chroma. - Process: For each input chunk, the script fetches top-K snippets from Chroma and injects them at the top of the prompt as a
[RAG context]block, bounded by--context_tokens.
- Default:
BAAI/bge-large-en-v1.5(fast, strong, 512-token limit). Keep--embed_token_limitaround 480. - Larger-context embedding models exist; once you pick one, just raise
--embed_token_limitaccordingly.
python paper_rewriter_plus.py ingest \
--inputs ./input_pdfs \
--chroma_path ./ragdb \
--collection papers \
--embed_model BAAI/bge-large-en-v1.5 \
--chunk_chars 1500 --chunk_overlap 200 \
--embed_token_limit 480
python paper_rewriter_plus.py query \
--chroma_path ./ragdb \
--collection papers \
--embed_model BAAI/bge-large-en-v1.5 \
--k 8 \
"your search text here"
python paper_rewriter_plus.py process \
--input my_paper.pdf \
--outdir outputs \
--task rewrite_academic \
--model gpt-4o-mini \
--rag --chroma_path ./ragdb --collection papers \
--k 6 --context_tokens 1200 \
--stream # optional
RuntimeError: The size of tensor a (<+512>) must match tensor b (512)
- Cause: embedding model has a 512-token limit; a chunk exceeded it.
- Fix: token-aware ingest is already enabled; keep
--embed_token_limit≤512 (use the default 480 for BGE-large).
Chroma telemetry warnings (capture() takes 1 positional argument ...)
- Telemetry is disabled in code. If you still see noise, set:
export CHROMADB_TELEMETRY_DISABLED=1
Local endpoint ignores model
- Supply the exact model name used by your server; some backends disregard or enforce the value.
GPU out-of-memory during embeddings
- Use a smaller embedding model (e.g.,
bge-small-en-v1.5) or run with CPU torch.
Slow ingest
- Reduce
--chunk_charsor increase batch size in code (seeB = 256).
- Keep your RAG library specific to your domain (papers you cite or build upon).
- Use
figure_suggestionsafterrewrite_academicto plan visuals; iterate. - Run
citation_scoutto generate search terms for missing references.
- Config file (YAML) for repeatable runs.
- Pluggable retrieval filters (by source, year, keywords).
- Support for long-context embedding models (raise
--embed_token_limit).
Licensed under the MIT License.