Never get a surprise API bill again.
A local HTTP proxy that sits between your code and any OpenAI-compatible API. It counts tokens in every request and blocks calls that would exceed your configured budget.
You're building a feature that calls an LLM. You run a test loop. You forget to add a break. You wake up to a $40 bill for 2 million tokens. Or worse — a runaway agent keeps retrying a broken prompt and burns through your monthly budget in an hour.
token-budget-proxy is a one-command local proxy that enforces hard token limits on every API call. No SDK changes, no code changes — just point your OPENAI_BASE_URL at the proxy.
Your code → http://localhost:8080 → token-budget-proxy → api.openai.com
|
counts tokens in request
checks against budget
blocks if over limit (HTTP 429)
or forwards if within budget
The proxy:
- Intercepts every POST to
/v1/chat/completionsor/v1/completions - Estimates the token count from the request body (prompt +
max_tokens) - Checks against your configured limits (per-request, per-minute, session total)
- Either forwards the request to the real API or returns a
429with a clear error message
pip install token-budget-proxyNo external dependencies. Works with Python 3.8+.
# Basic: block any single request over 2000 tokens
tokenproxy start --max-tokens-per-request 2000 --api-key sk-...
# Full budget control
tokenproxy start \
--max-tokens-per-request 4096 \
--max-tokens-per-minute 20000 \
--max-tokens-total 100000 \
--api-key sk-...# Environment variable (works with any OpenAI SDK)
export OPENAI_BASE_URL=http://127.0.0.1:8080/v1
export OPENAI_API_KEY=sk-... # still needed for the proxy to forward
python your_script.pyOr in code:
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8080/v1",
api_key="sk-...",
) [INFO] Token-budget proxy listening on http://127.0.0.1:8080
[INFO] Upstream: https://api.openai.com/v1
[INFO] Per-request limit: 4096 tokens
[INFO] POST /v1/chat/completions — prompt=312t max_out=512t
[INFO] POST /v1/chat/completions — prompt=891t max_out=1024t
[BLOCK] /v1/chat/completions — Request would use 5200 tokens, exceeding per-request limit of 4096.
--host HOST Bind address (default: 127.0.0.1)
--port PORT Listen port (default: 8080)
--upstream URL Real API base URL (default: https://api.openai.com/v1)
--api-key KEY API key (or set OPENAI_API_KEY)
--max-tokens-per-request N Hard limit per single request (default: 4096)
--max-tokens-per-minute N Rolling 60-second window budget (default: unlimited)
--max-tokens-total N Session lifetime budget (default: unlimited)
--warn-only Log violations but forward requests anyway
--quiet Suppress per-request logging
| Mode | Flag | Behaviour |
|---|---|---|
| Per-request | --max-tokens-per-request |
Blocks any single call over the limit |
| Per-minute | --max-tokens-per-minute |
Rolling 60s window — rate limiting |
| Session total | --max-tokens-total |
Hard cap for the entire proxy session |
| Warn-only | --warn-only |
Logs violations but never blocks |
All three limits can be combined. The strictest matching limit wins.
Any API that follows the OpenAI REST format works:
| Provider | Base URL |
|---|---|
| OpenAI | https://api.openai.com/v1 |
| Groq | https://api.groq.com/openai/v1 |
| Together AI | https://api.together.xyz/v1 |
| Ollama | http://localhost:11434/v1 |
| LM Studio | http://localhost:1234/v1 |
| Any OpenAI-compatible | your URL |
from tokenproxy import BudgetConfig, start_proxy
config = BudgetConfig(
max_tokens_per_request=2000,
max_tokens_per_minute=10000,
upstream_url="https://api.openai.com/v1",
upstream_api_key="sk-...",
)
# Blocking (runs until Ctrl+C)
start_proxy(config, port=8080)
# Non-blocking (background thread)
server = start_proxy(config, port=8080, block=False)
print(server.stats)
server.stop()The proxy uses a character-based approximation (1 token ≈ 4 characters) which requires no external libraries. This is accurate enough for budget enforcement — it may be off by 10–15% compared to the exact tokenizer.
To use exact counting, swap count_tokens_approx in tokenproxy/tokenizer.py with a call to tiktoken:
import tiktoken
def count_tokens_approx(text: str) -> int:
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))token-budget-proxy/
├── tokenproxy/
│ ├── __init__.py # Public API exports
│ ├── tokenizer.py # Token counting (approx + request body parsing)
│ ├── budget.py # BudgetConfig, BudgetTracker (thread-safe)
│ ├── server.py # ProxyServer, HTTP handler, request forwarding
│ ├── cli.py # CLI entry point
│ └── middleware/
│ └── __init__.py # Placeholder for future middleware
├── tests/
│ ├── test_tokenizer.py
│ └── test_budget.py
├── docs/
│ └── index.html
└── pyproject.toml
See CONTRIBUTING.md. Good first issues are labelled in the issue tracker.
MIT © 2026 Jishanahmed AR Shaikh