Your AI bill is a routing problem.
The Problem • The Fix • Quickstart • How It Works • Who It's For • Competition • Docs
npx frugalroute — one command, running in seconds
You're burning money on AI and you know it.
Every request goes to the same expensive cloud model — whether it's a trivial FAQ or a complex reasoning task. Your "summarize this email" costs the same as your "analyze this legal contract." Your team hardcodes model: "gpt-4o" because switching models means rewriting code. And when the bill lands, nobody can tell you which requests actually needed that firepower.
The real cost isn't the model. It's the lack of decision-making between your app and the model.
Meanwhile, that M4 MacBook Pro sitting on your desk? It can run an 8B parameter model at 50+ tokens/sec. For free. Right now. For 80% of your prompts, that's more than enough.
But nobody's using it, because wiring up local models, fallback logic, cost tracking, and caching is a month of engineering you'll never get approved.
FrugalRoute is one line of config between your app and your models.
# Before: hardcoded, expensive, blind
client = OpenAI(api_key="sk-...")
# After: routed, cached, tracked, learning
client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")That's it. Same OpenAI SDK. Same code. FrugalRoute intercepts every request and makes a decision:
- Can a local model handle this? Run it on Ollama. Cost: $0.
- Seen this before? Return it from the semantic cache. Cost: $0. Latency: ~1ms.
- Needs more muscle? Escalate to the cloud — but only the cheapest cloud model that's capable enough.
- Learn from it. Every cloud call becomes training data. Next time, the local model handles it.
The more you use it, the less you spend.
curl http://localhost:3100/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is dependency injection?"}]
}'That returned a standard OpenAI response. Your app doesn't know or care which model answered. FrugalRoute picked a local Llama model, skipped the cloud entirely, and logged the cost as $0.00.
Your app Models
┌────────┐ ┌──────────────────────────────┐
│ OpenAI │──HTTP──▶│ FrugalRoute │──▶ Ollama (local, free)
│ SDK │ │ │──▶ OpenAI (cloud, metered)
│ │◀──JSON──│ :3100/v1/chat/completions │──▶ Anthropic (cloud, metered)
└────────┘ └──────────────────────────────┘
Every request flows through a priority chain — cheapest first, escalate only when necessary.
Semantic Cache ──hit?──▶ return instantly ($0)
│ miss
Keyword Classifier ──obvious?──▶ route directly (<1ms)
│ uncertain
Embedding Classifier ──▶ classify intent (~4ms)
│
Local Model (Ollama) ──confident?──▶ return ($0)
│ low confidence
Bigger Local Model ──confident?──▶ return ($0)
│ still low
Cloud Model (cheapest capable) ──▶ return ($$)
│
Collect training pair ──▶ distill into local models
The confidence threshold isn't static — it adapts per capability based on real performance data. Summarization might need 0.7 confidence locally. Code generation might need 0.95. FrugalRoute figures this out from your traffic.
This is what no other router does.
Every time FrugalRoute escalates to the cloud, it captures the prompt and response as a training pair. Over time, you run the distillation pipeline, and your local models absorb the capabilities they used to delegate. Cloud spend decreases. Automatically.
Traffic ──▶ Local model fails ──▶ Cloud handles it
│
Training pair collected ◀───┘
│
Local model fine-tuned
│
Next time: local model handles it ──▶ $0
The integrity layer (based on TruthKeeper research) ensures you never train on stale, contradicted, or low-quality data. Every training pair is dependency-tracked and integrity-verified before it touches your models.
|
You're shipping fast and watching costs. FrugalRoute gives you GPT-4-level output on a ramen budget. Local models handle the bulk — cloud kicks in only when it matters. No infra team required. You'll love: Zero-config start, auto-learning, cost tracking per feature. |
You need governance, auditability, and vendor independence. FrugalRoute gives you per-key budgets, A/B testing across providers, full request provenance, and Prometheus metrics — without touching a single line of application code. You'll love: Virtual API keys, guardrails pipeline, budget enforcement, self-hosted deployment. |
You're tired of manually benchmarking models. FrugalRoute profiles your hardware, learns which models excel at what, and auto-adjusts routing weights from real traffic. The distillation pipeline means your local models get smarter over time — automatically. You'll love: Judge agent, multi-sampling, TruthKeeper integrity, hardware auto-profiling. |
Install from npm (recommended):
npm install -g frugalroute
frugalrouteOr run without installing:
npx frugalroute
# or with bun
bunx frugalrouteFrom source:
git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
cp .env.example .env
bun run devPull at least one local model and the embedding model:
ollama pull llama3.2
ollama pull nomic-embed-textPoint any OpenAI client at http://localhost:3100/v1 and set model to "auto".
Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")
r = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Explain monads like I'm five"}]
)
print(r.choices[0].message.content)TypeScript
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:3100/v1", apiKey: "unused" });
const r = await client.chat.completions.create({
model: "auto",
messages: [{ role: "user", content: "Explain monads like I'm five" }],
});
console.log(r.choices[0].message.content);curl
curl http://localhost:3100/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"auto","messages":[{"role":"user","content":"Explain monads like I'm five"}]}'Ruby
require "openai"
client = OpenAI::Client.new(uri_base: "http://localhost:3100/v1", access_token: "unused")
r = client.chat(parameters: {
model: "auto",
messages: [{ role: "user", content: "Explain monads like I'm five" }]
})
puts r.dig("choices", 0, "message", "content")Go
cfg := openai.DefaultConfig("unused")
cfg.BaseURL = "http://localhost:3100/v1"
client := openai.NewClientWithConfig(cfg)
resp, _ := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
Model: "auto",
Messages: []openai.ChatCompletionMessage{{Role: "user", Content: "Explain monads like I'm five"}},
})
fmt.Println(resp.Choices[0].Message.Content)All clients hit the same endpoint. FrugalRoute picks the model, runs inference, returns OpenAI-shaped JSON.
|
|
|
|
|
|
Every LLM gateway proxies requests. None of them think about them.
| liteLLM | OpenRouter | Portkey | Bifrost | FrugalRoute | |
|---|---|---|---|---|---|
| OpenAI-compatible drop-in | Yes | Yes | Yes | Yes | Yes |
| Routes by capability, not model name | Yes | ||||
| Local-first (Ollama, Apple Silicon) | Yes | ||||
| Semantic intent classification | Yes | ||||
| Confidence-based escalation cascade | Yes | ||||
| Two-tier semantic cache | Simple | Yes | |||
| Learns from traffic, self-improves | Yes | ||||
| Distills cloud into local models | Yes | ||||
| Hardware auto-profiling | Yes | ||||
| Budget enforcement per key/session | Partial | Partial | Yes | ||
| A/B testing across models | Yes | ||||
| MCP tool interoperability | Partial | Yes | |||
| Self-hosted, no vendor lock-in | Yes | Yes | Yes | Yes |
liteLLM is a great proxy. It connects 100+ providers behind one API. But it doesn't know what your prompt needs — you still pick the model. No local tier, no caching, no learning.
OpenRouter is a managed marketplace. Not self-hosted. Your data leaves your network.
Portkey has solid reliability features — retries, fallbacks, circuit breaking. But it routes by provider weight, not by prompt intent. No local models. No distillation.
Bifrost is fast (11us overhead). But it's a load balancer, not a router. It doesn't understand what your request needs.
They move traffic. FrugalRoute makes decisions.
# .env
PORT=3100
OLLAMA_BASE_URL=http://localhost:11434
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
EMBEDDING_MODEL=nomic-embed-text
DEFAULT_MAX_COST_PER_REQUEST=0.01# config/models.yaml
aliases:
fast: gemma3-4b
smart: claude-sonnet-4-20250514
cheap: llama3.2Full configuration reference: docs/user/configuration.mdx
| Guide | What it covers |
|---|---|
| Getting Started | Install, first request, connect existing clients |
| Architecture | Module map, request flow, design principles |
| Routing | Classification, escalation, bidding, weight adjustment |
| Caching | Two-tier semantic cache, adaptive thresholds |
| Cost Management | Estimation, tracking, budget enforcement |
| Configuration | Env vars, routes, models, budgets, thresholds |
| Deployment | Docker, production hardening, hardware profiling |
| Tools & MCP | Tool registry, MCP integration, format conversion |
| Distillation | Training flywheel, TruthKeeper integrity |
| API Reference | Complete HTTP endpoint reference |
Do I need Ollama installed?
For local inference, yes. FrugalRoute uses Ollama as its local model backend. Without it, requests route straight to cloud providers — which still gives you caching, cost tracking, and budget enforcement, but you miss the free local tier.
What models does it support?
Any model Ollama can run (Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, etc.), plus OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5) and Anthropic (Claude Opus, Sonnet, Haiku). Adding a new provider is one adapter interface.
Does it work with streaming?
Yes. Full SSE streaming with heartbeat keepalive, compatible with the OpenAI streaming format. Set "stream": true in your request — same as you would with OpenAI directly.
What's the routing overhead?
Keyword classification adds <1ms. Embedding-based classification adds ~4ms. Cache hits return in ~1ms. The routing decision itself is negligible compared to model inference time.
Can I force a specific model?
Yes. Set model to any registered model name (e.g., "gpt-4o", "llama3.2") instead of "auto". FrugalRoute will route directly to that model while still tracking cost and logging the request. You can also use aliases like "fast", "smart", or "cheap".
Is my data sent anywhere?
FrugalRoute is fully self-hosted. Local model requests never leave your machine. Cloud requests go directly to OpenAI/Anthropic — FrugalRoute never proxies through a third-party service. Training pairs for distillation are stored locally in SQLite.
How does distillation actually work?
When a request escalates to a cloud model, the prompt-response pair is captured, quality-scored by a judge agent, and stored locally. Running bun run distill feeds verified pairs into a fine-tuning pipeline for your local models. The TruthKeeper integrity layer ensures only high-quality, non-contradicted data is used. See Distillation docs.
What about function calling / tool use?
Supported. FrugalRoute's MCP tool registry unifies tools across MCP, OpenAI, and Anthropic formats. Tool calls are routed to the correct backend automatically.
Bun + Hono + Ollama + TypeScript
445 tests. 1,196 assertions. Two production dependencies.
git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
bun test # run all 445 tests
bun run dev # start dev server with hot reload
bun run lint # lint with Biome
bun run benchmark # run hardware benchmarks
bun run calibrate # calibrate keyword classifier thresholdsPRs welcome. Please run bun run check (lint + tests) before submitting.
PolyForm Small Business License 1.0.0 — free for individuals, small businesses (<100 people, <1M EUR revenue), nonprofits, and open source projects.
Commercial license for larger organizations: lisa@tastehub.io
Stop overpaying for AI. Start routing.
bunx frugalroute