FrugalRoute

Your AI bill is a routing problem.

The Problem • The Fix • Quickstart • How It Works • Who It's For • Competition • Docs

npx frugalroute — one command, running in seconds

The Problem

You're burning money on AI and you know it.

Every request goes to the same expensive cloud model — whether it's a trivial FAQ or a complex reasoning task. Your "summarize this email" costs the same as your "analyze this legal contract." Your team hardcodes model: "gpt-4o" because switching models means rewriting code. And when the bill lands, nobody can tell you which requests actually needed that firepower.

The real cost isn't the model. It's the lack of decision-making between your app and the model.

Meanwhile, that M4 MacBook Pro sitting on your desk? It can run an 8B parameter model at 50+ tokens/sec. For free. Right now. For 80% of your prompts, that's more than enough.

But nobody's using it, because wiring up local models, fallback logic, cost tracking, and caching is a month of engineering you'll never get approved.

The Fix

FrugalRoute is one line of config between your app and your models.

# Before: hardcoded, expensive, blind
client = OpenAI(api_key="sk-...")

# After: routed, cached, tracked, learning
client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")

That's it. Same OpenAI SDK. Same code. FrugalRoute intercepts every request and makes a decision:

Can a local model handle this? Run it on Ollama. Cost: $0.
Seen this before? Return it from the semantic cache. Cost: $0. Latency: ~1ms.
Needs more muscle? Escalate to the cloud — but only the cheapest cloud model that's capable enough.
Learn from it. Every cloud call becomes training data. Next time, the local model handles it.

The more you use it, the less you spend.

curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is dependency injection?"}]
  }'

That returned a standard OpenAI response. Your app doesn't know or care which model answered. FrugalRoute picked a local Llama model, skipped the cloud entirely, and logged the cost as $0.00.

How It Works

    Your app                                                    Models
   ┌────────┐         ┌──────────────────────────────┐
   │ OpenAI │──HTTP──▶│         FrugalRoute           │──▶  Ollama     (local, free)
   │  SDK   │         │                               │──▶  OpenAI     (cloud, metered)
   │        │◀──JSON──│  :3100/v1/chat/completions    │──▶  Anthropic  (cloud, metered)
   └────────┘         └──────────────────────────────┘

The Cascade

Every request flows through a priority chain — cheapest first, escalate only when necessary.

  Semantic Cache ──hit?──▶ return instantly ($0)
       │ miss
  Keyword Classifier ──obvious?──▶ route directly (<1ms)
       │ uncertain
  Embedding Classifier ──▶ classify intent (~4ms)
       │
  Local Model (Ollama) ──confident?──▶ return ($0)
       │ low confidence
  Bigger Local Model ──confident?──▶ return ($0)
       │ still low
  Cloud Model (cheapest capable) ──▶ return ($$)
       │
  Collect training pair ──▶ distill into local models

The confidence threshold isn't static — it adapts per capability based on real performance data. Summarization might need 0.7 confidence locally. Code generation might need 0.95. FrugalRoute figures this out from your traffic.

The Flywheel

This is what no other router does.

Every time FrugalRoute escalates to the cloud, it captures the prompt and response as a training pair. Over time, you run the distillation pipeline, and your local models absorb the capabilities they used to delegate. Cloud spend decreases. Automatically.

  Traffic ──▶ Local model fails ──▶ Cloud handles it
                                         │
              Training pair collected ◀───┘
                     │
              Local model fine-tuned
                     │
              Next time: local model handles it ──▶ $0

The integrity layer (based on TruthKeeper research) ensures you never train on stale, contradicted, or low-quality data. Every training pair is dependency-tracked and integrity-verified before it touches your models.

Who It's For

Startups & Small Teams

You're shipping fast and watching costs. FrugalRoute gives you GPT-4-level output on a ramen budget. Local models handle the bulk — cloud kicks in only when it matters. No infra team required.

You'll love: Zero-config start, auto-learning, cost tracking per feature.

Enterprise & Platform Teams

You need governance, auditability, and vendor independence. FrugalRoute gives you per-key budgets, A/B testing across providers, full request provenance, and Prometheus metrics — without touching a single line of application code.

You'll love: Virtual API keys, guardrails pipeline, budget enforcement, self-hosted deployment.

AI/ML Engineers

You're tired of manually benchmarking models. FrugalRoute profiles your hardware, learns which models excel at what, and auto-adjusts routing weights from real traffic. The distillation pipeline means your local models get smarter over time — automatically.

You'll love: Judge agent, multi-sampling, TruthKeeper integrity, hardware auto-profiling.

Quickstart

Install from npm (recommended):

npm install -g frugalroute
frugalroute

Or run without installing:

npx frugalroute
# or with bun
bunx frugalroute

From source:

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
cp .env.example .env
bun run dev

Pull at least one local model and the embedding model:

ollama pull llama3.2
ollama pull nomic-embed-text

Point any OpenAI client at http://localhost:3100/v1 and set model to "auto".

Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")
r = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain monads like I'm five"}]
)
print(r.choices[0].message.content)

TypeScript

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:3100/v1", apiKey: "unused" });
const r = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }],
});
console.log(r.choices[0].message.content);

curl

curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Explain monads like I'm five"}]}'

Ruby

require "openai"

client = OpenAI::Client.new(uri_base: "http://localhost:3100/v1", access_token: "unused")
r = client.chat(parameters: {
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }]
})
puts r.dig("choices", 0, "message", "content")

Go

cfg := openai.DefaultConfig("unused")
cfg.BaseURL = "http://localhost:3100/v1"
client := openai.NewClientWithConfig(cfg)

resp, _ := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
    Model:    "auto",
    Messages: []openai.ChatCompletionMessage{{Role: "user", Content: "Explain monads like I'm five"}},
})
fmt.Println(resp.Choices[0].Message.Content)

All clients hit the same endpoint. FrugalRoute picks the model, runs inference, returns OpenAI-shaped JSON.

What's Under the Hood

Routing & Classification Semantic intent classification via embeddings (nomic-embed-text) Sub-1ms keyword pre-classifier for obvious cases Composite scoring with cascade confidence Capability matching — models declare strengths, requests state needs Multi-model sampling with judge or majority voting A/B testing with weighted traffic splits Sticky sessions for multi-turn conversation consistency Agent-specific routing strategies	Performance & Reliability Two-tier cache: exact-match LRU + vector similarity PeakEWMA latency tracking — routes around degraded providers Error-type aware circuit breaker (429 vs 500 vs timeout) Full SSE streaming with heartbeat keepalive Graceful shutdown with in-flight request draining Hardware auto-profiling (Apple Silicon, CUDA, ROCm)
Cost & Governance Real-time cost tracking per request, key, session, and tag Pre-flight budget enforcement — stops before it spends Cache-aware pricing in routing decisions Virtual API keys with independent limits per team Token bucket rate limiting per key Windowed budgets with configurable time windows	Learning & Distillation Routing weights adapt from real success/failure signals Judge agent for structural quality evaluation Distillation pipeline: cloud responses train local models TruthKeeper integrity layer prevents stale training data Epistemic state tracking (Supported / Hypothesis / Contested) Conversation compaction for long context management
Operations Model aliases: `fast`, `smart`, `cheap` — decouple code from models Prometheus metrics (`frugalroute_*`) YAML model config (`config/models.yaml`) OpenAPI spec at `/openapi.json` One-command calibration tooling	Extensibility MCP tool registry (MCP + OpenAI + Anthropic tools, unified) Guardrails pipeline for pre/post content filtering Provider adapters: Ollama, OpenAI, Anthropic Plug in new providers by implementing one interface Bidding/auction system for ambiguous routing decisions

The Competition

Every LLM gateway proxies requests. None of them think about them.

	liteLLM	OpenRouter	Portkey	Bifrost	FrugalRoute
OpenAI-compatible drop-in	Yes	Yes	Yes	Yes	Yes
Routes by capability, not model name					Yes
Local-first (Ollama, Apple Silicon)					Yes
Semantic intent classification					Yes
Confidence-based escalation cascade					Yes
Two-tier semantic cache			Simple		Yes
Learns from traffic, self-improves					Yes
Distills cloud into local models					Yes
Hardware auto-profiling					Yes
Budget enforcement per key/session	Partial			Partial	Yes
A/B testing across models					Yes
MCP tool interoperability	Partial				Yes
Self-hosted, no vendor lock-in	Yes		Yes	Yes	Yes

liteLLM is a great proxy. It connects 100+ providers behind one API. But it doesn't know what your prompt needs — you still pick the model. No local tier, no caching, no learning.

OpenRouter is a managed marketplace. Not self-hosted. Your data leaves your network.

Portkey has solid reliability features — retries, fallbacks, circuit breaking. But it routes by provider weight, not by prompt intent. No local models. No distillation.

Bifrost is fast (11us overhead). But it's a load balancer, not a router. It doesn't understand what your request needs.

They move traffic. FrugalRoute makes decisions.

Configuration

# .env
PORT=3100
OLLAMA_BASE_URL=http://localhost:11434
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
EMBEDDING_MODEL=nomic-embed-text
DEFAULT_MAX_COST_PER_REQUEST=0.01

# config/models.yaml
aliases:
  fast: gemma3-4b
  smart: claude-sonnet-4-20250514
  cheap: llama3.2

Full configuration reference: docs/user/configuration.mdx

Documentation

Guide	What it covers
Getting Started	Install, first request, connect existing clients
Architecture	Module map, request flow, design principles
Routing	Classification, escalation, bidding, weight adjustment
Caching	Two-tier semantic cache, adaptive thresholds
Cost Management	Estimation, tracking, budget enforcement
Configuration	Env vars, routes, models, budgets, thresholds
Deployment	Docker, production hardening, hardware profiling
Tools & MCP	Tool registry, MCP integration, format conversion
Distillation	Training flywheel, TruthKeeper integrity
API Reference	Complete HTTP endpoint reference

FAQ

Do I need Ollama installed?

For local inference, yes. FrugalRoute uses Ollama as its local model backend. Without it, requests route straight to cloud providers — which still gives you caching, cost tracking, and budget enforcement, but you miss the free local tier.

What models does it support?

Any model Ollama can run (Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, etc.), plus OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5) and Anthropic (Claude Opus, Sonnet, Haiku). Adding a new provider is one adapter interface.

Does it work with streaming?

Yes. Full SSE streaming with heartbeat keepalive, compatible with the OpenAI streaming format. Set "stream": true in your request — same as you would with OpenAI directly.

What's the routing overhead?

Keyword classification adds <1ms. Embedding-based classification adds ~4ms. Cache hits return in ~1ms. The routing decision itself is negligible compared to model inference time.

Can I force a specific model?

Yes. Set model to any registered model name (e.g., "gpt-4o", "llama3.2") instead of "auto". FrugalRoute will route directly to that model while still tracking cost and logging the request. You can also use aliases like "fast", "smart", or "cheap".

Is my data sent anywhere?

FrugalRoute is fully self-hosted. Local model requests never leave your machine. Cloud requests go directly to OpenAI/Anthropic — FrugalRoute never proxies through a third-party service. Training pairs for distillation are stored locally in SQLite.

How does distillation actually work?

When a request escalates to a cloud model, the prompt-response pair is captured, quality-scored by a judge agent, and stored locally. Running bun run distill feeds verified pairs into a fine-tuning pipeline for your local models. The TruthKeeper integrity layer ensures only high-quality, non-contradicted data is used. See Distillation docs.

What about function calling / tool use?

Supported. FrugalRoute's MCP tool registry unifies tools across MCP, OpenAI, and Anthropic formats. Tool calls are routed to the correct backend automatically.

Built With

Bun + Hono + Ollama + TypeScript

445 tests. 1,196 assertions. Two production dependencies.

Contributing

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
bun test           # run all 445 tests
bun run dev        # start dev server with hot reload
bun run lint       # lint with Biome
bun run benchmark  # run hardware benchmarks
bun run calibrate  # calibrate keyword classifier thresholds

PRs welcome. Please run bun run check (lint + tests) before submitting.

License

PolyForm Small Business License 1.0.0 — free for individuals, small businesses (<100 people, <1M EUR revenue), nonprofits, and open source projects.

Commercial license for larger organizations: lisa@tastehub.io

Stop overpaying for AI. Start routing.

bunx frugalroute

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
bin		bin
config		config
docs		docs
npm		npm
src		src
test		test
.env.example		.env.example
.gitignore		.gitignore
.npmignore		.npmignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FrugalRoute

The Problem

The Fix

How It Works

The Cascade

The Flywheel

Who It's For

Startups & Small Teams

Enterprise & Platform Teams

AI/ML Engineers

Quickstart

What's Under the Hood

Routing & Classification

Performance & Reliability

Cost & Governance

Learning & Distillation

Operations

Extensibility

The Competition

Configuration

Documentation

FAQ

Built With

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FrugalRoute

The Problem

The Fix

How It Works

The Cascade

The Flywheel

Who It's For

Startups & Small Teams

Enterprise & Platform Teams

AI/ML Engineers

Quickstart

What's Under the Hood

Routing & Classification

Performance & Reliability

Cost & Governance

Learning & Distillation

Operations

Extensibility

The Competition

Configuration

Documentation

FAQ

Built With

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages