quickthink is a local-first inference control layer that helps small models produce more reliable structured outputs with latency-aware routing.
It currently ships as a lightweight scaffolding layer for local LLMs with three modes:
lite(default): one-pass inline plan prefix + answer in a single generationtwo_pass: separate plan call then answer calldirect: no planning pass, raw prompt to model
The plan can be logged as metadata while hidden from normal UI output.
QuickThink is designed to be easy for both humans and agents to classify and adopt:
- local LLM routing for local-first inference pipelines
- small model optimization for constrained hardware and low-latency workflows
- latency-aware inference via routing, bypass, and planning-budget controls
- structured output reliability through strict planning grammar and eval gates
- Ollama middleware for practical local deployment
- agent runtime compatibility for CLI and automation-driven execution contexts
What this is:
- A local middleware layer for Ollama-backed LLM calls.
- A small CLI for planned-answer generation, routing diagnostics, and local benchmarking.
- A canonical eval harness for reproducible project-level quality checks.
What this is not:
- Not a hosted API service.
- Not a model training framework.
- Not a replacement for full agent orchestration platforms.
Small/local models are fast but often underperform on multi-step tasks.
quickthink adds a strict planning pass (6-16 keyword tokens by default) to improve response quality without full verbose reasoning traces.
- Ollama-first integration
- Model profiles:
qwen2.5:1.5b,mistral:7b,gemma3:27b - Three execution modes:
lite(default),two_pass,direct - Preset routing profiles:
fast,balanced,strict - Lane policy:
defaultorstrict_safe(routes strict-format tasks to direct path) - Hidden plan by default, optional plan display/logging
- Bypass mode for short prompts (latency control)
- Adaptive routing (
skip,12-token,max-tokenplanning lanes) - Strict plan grammar:
g:<...>;c:<...>;s:<...>;r:<...> - Local eval UI server (
quickthink ui) athttp://127.0.0.1:7860 - Canonical eval harness: run → judge → validate → report
python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'Prerequisite: install and start Ollama locally.
# 1) Clone and enter repo
git clone https://github.com/hermes-labs-ai/quickthink.git quickthink
cd quickthink
# 2) Create env and install
python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
# 3) Pull one supported model
ollama pull qwen2.5:1.5b
# 4) Run your first command
quickthink ask "Give me a 3-step plan to learn SQL basics" --model qwen2.5:1.5bIf this command works, your local setup is ready.
- Docs index:
docs/README.md - First-time setup:
docs/GETTING_STARTED.md - Common failures and fixes:
docs/TROUBLESHOOTING.md - Known limitations:
docs/KNOWN_LIMITATIONS.md - Quick demo script:
docs/demo/QUICK_DEMO.md - OSS readiness scorecard:
docs/release/OSS_READINESS_SCORECARD_2026-02-25.md - OSS standards alignment (with external references):
docs/release/OSS_STANDARDS_ALIGNMENT_2026.md - Agent operating notes:
AGENTS.md
src/quickthink/ Runtime package (CLI, engine, prompts, routing, UI server)
scripts/eval_harness/ Canonical evaluation pipeline (run/judge/validate/report)
scripts/evals/ Legacy smoke/demo helpers (non-canonical)
scripts/demo/ One-command local demo runner
docs/evals/ Prompt sets, rubrics, harness specs, deployment gate notes
docs/release/ Release process and repository audit notes
tests/ Unit tests for runtime and harness safety checks
See full architecture + publishability audit:
docs/release/REPO_STRUCTURE_AND_PUBLISHABILITY_AUDIT_2026-02-20.md.
Canonical project workflows:
scripts/eval_harness/*: maintained evaluation pipeline for run/judge/validate/report.scripts/demo/quickstart.sh: canonical end-to-end local smoke/demo flow.
Legacy helpers (kept for compatibility and ad-hoc smoke checks):
scripts/evals/*: non-canonical helpers; do not treat as release gate source of truth.
When in doubt, use scripts/eval_harness/* and scripts/demo/quickstart.sh.
List supported profiles:
quickthink list-modelsList preset routing profiles:
quickthink list-presetsShow officially supported compatibility models:
quickthink compatibilityAsk with compressed planning:
quickthink ask "How would a cow round up a border collie?" --model qwen2.5:1.5b --preset balancedShow plan in terminal:
quickthink ask "How would a cow round up a border collie?" --model mistral:7b --show-planSwitch to two-pass mode:
quickthink ask "How would a cow round up a border collie?" --mode two_pass --show-route --show-planShow routing diagnostics:
quickthink ask "Design a robust parser with tradeoffs and a JSON output schema" --show-route --show-planOptional continuity hint (tiny, off by default):
quickthink ask "Continue the previous structure" --continuity-hint "ctx:prior_goal,format_json"Strict-format-safe lane policy (routes strict format tasks to direct path first):
quickthink ask "json only: {\"ok\":true,\"why\":\"short\"}" --lane-policy strict_safe --show-routeBenchmark with strict-safe lane policy:
quickthink bench "Answer with YES or NO only: Is 2+2=4?" --lane-policy strict_safe --runs 3Log plan + metrics as JSONL metadata:
quickthink ask "Design a tiny retry strategy" --log-file ./logs/quickthink.jsonlBenchmark all three modes (lite, two_pass, direct):
quickthink bench "Design a robust parser for CSV with malformed quotes" --model qwen2.5:1.5b --runs 3Run full local demo setup and artifact generation:
bash scripts/demo/quickstart.shIt does:
- Python env + package install
ollama pullfor supported models- Sample A/B/C eval run
- Result validation
- Markdown/HTML report generation
- Compatibility snapshot update
For a one-minute terminal walkthrough command set, see docs/demo/QUICK_DEMO.md.
Optional environment flags:
QUICKTHINK_PRESET=fast|balanced|strictQUICKTHINK_LIMIT=<n>(number of prompts from canonical set)QUICKTHINK_RUNS=<n>QUICKTHINK_RUN_JUDGE=1(switch judge backend fromruletoollama)QUICKTHINK_JUDGE_MODEL=<model>
For common setup/runtime failures and fixes, see docs/TROUBLESHOOTING.md.
Canonical report flow:
python3 scripts/eval_harness/run_suite.py \
--prompt-set docs/evals/prompt_set.jsonl \
--out docs/evals/results/run-<timestamp>.jsonl \
--manifest-out docs/evals/results/manifest-<timestamp>.json \
--runs 3
python3 scripts/eval_harness/judge_suite.py \
--prompt-set docs/evals/prompt_set.jsonl \
--results docs/evals/results/run-<timestamp>.jsonl \
--out docs/evals/results/judged-<timestamp>.jsonl \
--backend rule
python3 scripts/eval_harness/validate_judged_results.py \
--path docs/evals/results/judged-<timestamp>.jsonl
python3 scripts/eval_harness/report_suite.py \
--runs docs/evals/results/run-<timestamp>.jsonl \
--judged docs/evals/results/judged-<timestamp>.jsonl \
--out-json docs/evals/results/report-<timestamp>.json \
--out-md docs/evals/results/report-<timestamp>.md \
--out-html docs/evals/results/report-<timestamp>.htmlLegacy helpers in scripts/evals/* remain available for smoke/demo use only.
- Supported models are fixed to:
qwen2.5:1.5bmistral:7bgemma3:27b
- Experimental evaluations may include additional models (for example
llama3.2:latest) in deployment-gate or variant-gate workflows. Treat those as research lanes unless promoted intoSUPPORTED_MODELSin runtime config. - Regenerate matrix + snapshot with:
python3 scripts/evals/compat_matrix_snapshot.pyLaunch local web UI (for eval/scaffolding testing):
quickthink uiThen open http://127.0.0.1:7860 if it does not open automatically.
UI lane control:
Lane policydropdown supportsdefaultandstrict_safefor single-prompt runs and 3-mode comparisons.
UI eval safety gates:
- Preflight is required before any eval run (
validate_prompt_set.pymust returnstatus=OK). - Run-file ingestion is blocked unless
validate_results.pyreturnsstatus=OK. - UI displays validator output and dataset SHA256 for reproducible/comparable runs.
- p50 overhead target: <80ms
- p95 overhead target: <200ms
Tune by reducing plan budgets and enabling prompt bypass.
Free/Open source:
- Local middleware + SDK + CLI
Paid:
- Hosted eval dashboards
- Team policy/profile management
- Managed observability and support
Included in this public repository:
- runtime source code (
src/quickthink) - reusable evaluation harness (
scripts/eval_harness,docs/evalsprompt/spec files) - tests and release process notes
Excluded from public tracking:
- internal multi-agent comms logs
- generated eval result dumps and ad-hoc local traces
- private experiment workspaces under
experiments-local/
- Keep version tracks isolated in
codex/*branches. - Merge to
mainonly after benchmarks and notes are updated. - See
docs/VERSION_NOTES.mdfor version-to-version differences.
Install (editable + dev):
python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'Test:
PYTHONPATH=src .venv/bin/pytest -qLint (basic syntax/import sanity):
python -m compileall src tests scriptsRelease docs + checklist:
make release-check VERSION=x.y.zFollow:
docs/release/RELEASE_CHECKLIST.mddocs/release/RELEASE_PROCESS.mddocs/release/SUPPLY_CHAIN_BASELINE_2026.md
- This does not guarantee better answers for every prompt.
- Gains are model/task dependent; run evals before claiming improvements.
- Hidden planning should remain auditable in logs for transparency.
Apache-2.0
Hermes Labs builds AI audit infrastructure for teams deploying AI agents in regulated environments. All tools are released as open-source software — MIT or Apache-2.0, no SaaS tier. The audit work is paid; the code is not.
hermes-labs.ai
| Layer | Tool | Description |
|---|---|---|
| Static audit | lintlang | Agent-config static lint (HERM + H1-H7) |
| Static audit | rule-audit | Rule-logic audit: contradictions + gaps |
| Static audit | scaffold-lint | Scaffold budget + technique stacking |
| Static audit | intent-verify | Spec-drift checks |
| Runtime observability | little-canary | Prompt injection detection |
| Runtime observability | suy-sideguy | Runtime policy guard |
| Runtime observability | colony-probe | Prompt confidentiality audit |
| Regression & scoring | hermes-jailbench | Jailbreak regression benchmark |
| Regression & scoring | agent-convergence-scorer | N-agent output consistency |
| Supporting infra | claude-router | Model-tier + scaffold router |
| Supporting infra | quickthink | Compressed planning scaffold for local LLMs |
| Supporting infra | langstate | Scaffold-aware context compression |
| Supporting infra | agent-gorgon | Tool-fabrication defense for Claude Code |
| Supporting infra | zer0dex | Dual-layer agent memory |
| Supporting infra | forgetted | Mid-conversation incognito |
| Dev tools | repo-audit | Launch-readiness auditor |
| Dev tools | quick-gate-python | Python quality gate |
| Dev tools | quick-gate-js | JS/TS quality gate |
| Dev tools | csv-quality-gate | CSV preflight validation |