TopHumanWriting is a backend-only Python package (CLI + library) for “exemplar-alignment writing audit”:
- Build a local PDF exemplar library (50–200 PDFs, zh/en/mixed)
- Audit a target PDF and output white-box results:
- what looks unlike exemplars, where, and why
- exemplar evidence (PDF + page)
- optional “rewrite templates” (controlled, temperature=0)
- Optional CiteCheck: author-year citation accuracy with evidence paragraphs
This PyPI distribution intentionally does not ship the web UI (to keep the package lean).
- Minimal:
pip install tophumanwriting - With RAG (required for exemplar retrieval):
pip install "tophumanwriting[rag]"(default: Chroma)- or
pip install "tophumanwriting[rag-faiss]"(FAISS)
- With optional syntax checks:
pip install "tophumanwriting[syntax]" - Everything for backend:
pip install "tophumanwriting[all]"
-
(Optional) Configure OpenAI-compatible LLM API (used by LLM review + CiteCheck):
TOPHUMANWRITING_LLM_API_KEY(fallback:SKILL_LLM_API_KEY,OPENAI_API_KEY)TOPHUMANWRITING_LLM_BASE_URL(fallback:SKILL_LLM_BASE_URL,OPENAI_BASE_URL, usually ends with/v1)TOPHUMANWRITING_LLM_MODEL(fallback:SKILL_LLM_MODEL,OPENAI_MODEL)
-
Run end-to-end (build once if needed → audit):
thw run --paper main.pdf --exemplars reference_papers --max-llm-tokens 200000
-
Output: the CLI prints an export folder that contains
result.json+report.md.
Budgeting:
- Use
--max-llm-tokensto hard-cap total LLM usage per run (LLM review + CiteCheck). - (Optional) Provide
--cost-per-1m-tokensand--max-costto show an approximate cost in reports (unitless; depends on your pricing).
- Download the semantic embedder model (once):
thw models download-semantic
- Build exemplar library artifacts (slow, one-time):
thw library build --name reference_papers --pdf-root reference_papers
- Run audits repeatedly (fast reuse):
thw audit run --paper main.pdf --library reference_papers --max-llm-tokens 200000
from tophumanwriting import TopHumanWriting
thw = TopHumanWriting(exemplars="reference_papers") # folder with PDFs
export = thw.run("main.pdf", max_llm_tokens=200000) # fit if needed + audit
print(export.report_md_path)TopHumanWriting stores reusable artifacts under a writable data directory:
TopHumanWriting_data/settings.json(optional LLM config)libraries/*.json(library manifests / stats)libraries/<name>.sentences.json+libraries/<name>.embeddings.npyrag/<library>/(RAG index)cite/<library>/(citation bank)audit/exports/(export bundles)
Override with TOPHUMANWRITING_DATA_DIR (legacy AIWORDDETECTOR_DATA_DIR also works).
- Only text-based PDFs are supported (scanned PDFs are out of scope).
- If you change the semantic model but keep an old index, rebuild the library to see changes.
- Bump version in
pyproject.toml - Build:
python -m build
- Verify:
python -m twine check dist/*
- Upload:
python -m twine upload dist/tophumanwriting-<version>*
TopHumanWriting 是一个后端 Python 包(CLI + SDK),用于“模仿同领域顶级范文写法”的对照式白箱体检:
- 你提供本地 PDF 范文库(50–200 篇,中英混合可)
- 对待检测 PDF 做端到端体检并输出白箱结果:
- 哪里不像范文、为什么不像、参考哪段范文(PDF+页码)
- 可选:给出“可复用的改写模板/句式骨架”(温度=0,尽量不发散)
- 可选:引用核查(CiteCheck),核查 author-year 引用是否准确/是否张冠李戴(附证据段落)
本 PyPI 包为了更轻量,不包含前端网页。
- 最小安装:
pip install tophumanwriting - 安装检索编排(做范文检索必需):
pip install "tophumanwriting[rag]"(默认:Chroma)- 或
pip install "tophumanwriting[rag-faiss]"(FAISS)
- 可选句法检查:
pip install "tophumanwriting[syntax]" - 一次装齐后端所有依赖:
pip install "tophumanwriting[all]"
-
(可选)配置 OpenAI 兼容大模型 API(用于 LLM 分治体检 + CiteCheck):
TOPHUMANWRITING_LLM_API_KEY(fallback:SKILL_LLM_API_KEY,OPENAI_API_KEY)TOPHUMANWRITING_LLM_BASE_URL(fallback:SKILL_LLM_BASE_URL,OPENAI_BASE_URL,通常以/v1结尾)TOPHUMANWRITING_LLM_MODEL(fallback:SKILL_LLM_MODEL,OPENAI_MODEL)
-
一条命令端到端运行(必要时会自动建库):
thw run --paper main.pdf --exemplars reference_papers --max-llm-tokens 200000
-
输出:命令行会打印导出目录,里面包含
result.json+report.md。
预算说明:
- 用
--max-llm-tokens硬限制单次运行的 LLM 总 tokens(同时覆盖 LLM 分治体检 + 引用核查)。 - (可选)用
--cost-per-1m-tokens+--max-cost仅用于在报告里展示估算成本(单位自定)。
- 下载一次语义模型:
thw models download-semantic - 建范文库工件(慢,一次性):
thw library build --name reference_papers --pdf-root reference_papers - 反复体检(复用索引):
thw audit run --paper main.pdf --library reference_papers --max-llm-tokens 200000
from tophumanwriting import TopHumanWriting
thw = TopHumanWriting(exemplars="reference_papers")
export = thw.run("main.pdf", max_llm_tokens=200000)
print(export.report_md_path)TopHumanWriting 会把可复用的工件写到数据目录:
TopHumanWriting_data/settings.jsonlibraries/*.jsonlibraries/<库名>.sentences.json+libraries/<库名>.embeddings.npyrag/<库名>/(检索索引)cite/<库名>/(引用句式库)audit/exports/(导出包)
可用环境变量 TOPHUMANWRITING_DATA_DIR 覆盖(旧的 AIWORDDETECTOR_DATA_DIR 也兼容)。
- 仅支持可复制文字的文本型 PDF;扫描版不考虑。
- 更换语义模型后建议重建范文库工件,否则效果可能看起来没变化。
TopHumanWriting/
├── tophumanwriting/ # PyPI package (CLI + sklearn-style API)
│ ├── api.py # TopHumanWriting.fit/audit/run
│ ├── cli.py # `thw` entrypoint
│ ├── models.py # semantic model download/status
│ ├── _version.py
│ └── locales/
├── aiwd/ # audit core (RAG/citecheck/LLM reviews)
├── ai_word_detector.py # legacy module (kept for compatibility)
├── pyproject.toml
├── MANIFEST.in
└── README.md
MIT License - See LICENSE