Tiny UNIX-style tool for prompt compression using LLMLingua-2 that preserves meaning while cutting token count, saving latency and money.
# with uv (recommended)
uv pip install -e .
# or with pip
pip install -e .Requires Python 3.9+.
# compress from stdin (default ratio 0.7)
echo "long prompt" | tinyprompt
# compress a file
tinyprompt --input prompt.txt
# compress from clipboard (macOS)
pbpaste | tinyprompt
# JSON output (prints {"compressed_prompt": ...})
tinyprompt --input prompt.txt --format json# faster defaults on CPU
tinyprompt --input prompt.txt --fast --cpu
# keep a warm server running
tinyprompt --serve --port 8012
# forward CLI to the warm server
tinyprompt --input prompt.txt --server-url http://127.0.0.1:8012--ratio FLOAT(default 0.7): target compression level--target-tokens INT: aim for a fixed token budget (overrides ratio)--input PATH|TEXT: file path or literal text (use-for stdin)--format {text,json}: output format--fast: speed‑tuned defaults (works great on CPU)--cpu: force CPU (ignore CUDA/MPS)--threads INT: limit CPU threads--cache-dir PATH: set HF cache location--offline: use local cache only (no downloads)- Server:
--serve,--port,--server-url
TINYPROMPT_MODEL– override model idTINYPROMPT_RATIO– default ratio when--rationot passedTINYPROMPT_PORT– default port for--serveHF_HOME,TRANSFORMERS_CACHE– Hugging Face cache dirHF_HUB_OFFLINE,TRANSFORMERS_OFFLINE– offline mode
- First run is slow? Models download once to your HF cache. Reuse with
--serve. - Port in use? Pick a different
--port. - Need fully offline? Run once online, then use
--offline. - Want fewer tokens, not a ratio? Use
--target-tokens.
uv pip install -e .[test]
uv run pytest -q