A practical Python library for developers working with Large Language Models.
Stop reinventing the wheel. This toolkit provides battle-tested utilities for the common challenges every AI developer faces: token counting, cost tracking, retry logic, caching, rate limiting, and output validation.
Building with LLMs means dealing with the same problems over and over:
- "How many tokens is this prompt?"
- "How much is this API call going to cost?"
- "The API timed outβnow what?"
- "I'm hitting rate limits constantly"
- "How do I reliably extract JSON from model outputs?"
- "I'm making the same expensive calls repeatedly"
This library solves all of these with clean, typed, well-tested code.
pip install llm-toolkitOr install from source:
git clone https://github.com/ThePagePage/llm-toolkit.git
cd llm-toolkit
pip install -e .from llm_toolkit import count_tokens, estimate_cost
# Count tokens for any major model
tokens = count_tokens("Hello, how are you today?", model="gpt-4")
print(f"Token count: {tokens}") # Token count: 7
# Works with messages too
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
tokens = count_tokens(messages, model="gpt-4")from llm_toolkit import estimate_cost, CostTracker
# Quick estimate
cost = estimate_cost(
input_tokens=1000,
output_tokens=500,
model="gpt-4-turbo"
)
print(f"Estimated cost: ${cost:.4f}")
# Track costs across your application
tracker = CostTracker()
# Log each API call
tracker.log(model="gpt-4-turbo", input_tokens=1500, output_tokens=800)
tracker.log(model="gpt-3.5-turbo", input_tokens=3000, output_tokens=1200)
# Get summaries
print(tracker.summary())
# {
# 'total_cost': 0.0845,
# 'total_input_tokens': 4500,
# 'total_output_tokens': 2000,
# 'by_model': {...}
# }from llm_toolkit import retry_with_backoff, RetryConfig
import openai
# Simple retry with sensible defaults
@retry_with_backoff()
def call_api(prompt):
return openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Custom retry configuration
config = RetryConfig(
max_retries=5,
initial_delay=1.0,
max_delay=60.0,
exponential_base=2,
retry_on=[openai.RateLimitError, openai.APITimeoutError]
)
@retry_with_backoff(config)
def robust_call(prompt):
return openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)from llm_toolkit import ResponseCache
# In-memory cache (default)
cache = ResponseCache()
# File-based persistent cache
cache = ResponseCache(backend="file", path="./llm_cache")
# Redis cache for distributed systems
cache = ResponseCache(backend="redis", url="redis://localhost:6379")
# Use the cache
cache_key = cache.make_key(model="gpt-4", messages=messages, temperature=0)
if cached := cache.get(cache_key):
response = cached
else:
response = openai.chat.completions.create(...)
cache.set(cache_key, response, ttl=3600) # Cache for 1 hourfrom llm_toolkit import RateLimiter, TokenBucket
# Simple rate limiter (10 requests per minute)
limiter = RateLimiter(requests_per_minute=10)
for prompt in prompts:
limiter.wait() # Blocks if necessary
response = call_api(prompt)
# Token bucket for more control
bucket = TokenBucket(
capacity=100, # Max burst
refill_rate=10, # Tokens per second
tokens_per_request=1
)
# Async support
async with limiter:
response = await async_call_api(prompt)from llm_toolkit import extract_json, validate_output, OutputSchema
from pydantic import BaseModel
# Extract JSON from messy LLM output
response_text = """
Sure! Here's the data you requested:
```json
{"name": "Alice", "age": 30, "city": "London"}Let me know if you need anything else! """
data = extract_json(response_text)
class Person(BaseModel): name: str age: int city: str
person = validate_output(response_text, Person)
text_with_multiple = "First: {"a": 1} and second: {"b": 2}" all_json = extract_json(text_with_multiple, multiple=True)
## π Supported Models & Pricing
Token counting and cost estimation supports:
| Provider | Models |
|----------|--------|
| OpenAI | GPT-4, GPT-4 Turbo, GPT-4o, GPT-3.5 Turbo, o1, o1-mini |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus/Sonnet/Haiku |
| Google | Gemini 1.5 Pro/Flash, Gemini 1.0 Pro |
| Mistral | Mistral Large, Medium, Small, Mixtral |
| Meta | Llama 3.1 (405B, 70B, 8B) |
| Cohere | Command R+, Command R |
Pricing is updated regularly. You can also provide custom pricing:
```python
from llm_toolkit import estimate_cost, set_custom_pricing
set_custom_pricing("my-fine-tuned-model", {
"input": 0.003, # $ per 1K tokens
"output": 0.006
})
cost = estimate_cost(1000, 500, model="my-fine-tuned-model")
from llm_toolkit import configure
configure(
default_model="gpt-4-turbo",
cache_backend="redis",
cache_url="redis://localhost:6379",
rate_limit_rpm=60,
retry_max_attempts=3
)Environment variables are also supported:
LLM_TOOLKIT_DEFAULT_MODEL=gpt-4-turbo
LLM_TOOLKIT_CACHE_BACKEND=file
LLM_TOOLKIT_CACHE_PATH=./cache# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=llm_toolkitllm-toolkit/
βββ llm_toolkit/
β βββ __init__.py # Public API exports
β βββ tokens.py # Token counting
β βββ costs.py # Cost estimation & tracking
β βββ retry.py # Retry with exponential backoff
β βββ cache.py # Response caching
β βββ rate_limit.py # Rate limiting utilities
β βββ validation.py # Output parsing & validation
β βββ models.py # Model definitions & pricing
βββ tests/
βββ examples/
βββ pyproject.toml
βββ README.md
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Add support for more model providers
- Improve token counting accuracy
- Add more caching backends
- Better async support
- Documentation improvements
MIT License - see LICENSE for details.
- tiktoken - OpenAI's token counting library
- tenacity - Inspiration for retry patterns
- The AI developer community for feedback and ideas
Built by developers, for developers. If this saves you time, consider giving it a β!