Skip to content

SimplyLiz/FrugalRoute

Repository files navigation

FrugalRoute

FrugalRoute

Your AI bill is a routing problem.

CI npm License Bun Tests

The ProblemThe FixQuickstartHow It WorksWho It's ForCompetitionDocs


npx frugalroute — one command, running in seconds


The Problem

You're burning money on AI and you know it.

Every request goes to the same expensive cloud model — whether it's a trivial FAQ or a complex reasoning task. Your "summarize this email" costs the same as your "analyze this legal contract." Your team hardcodes model: "gpt-4o" because switching models means rewriting code. And when the bill lands, nobody can tell you which requests actually needed that firepower.

The real cost isn't the model. It's the lack of decision-making between your app and the model.

Meanwhile, that M4 MacBook Pro sitting on your desk? It can run an 8B parameter model at 50+ tokens/sec. For free. Right now. For 80% of your prompts, that's more than enough.

But nobody's using it, because wiring up local models, fallback logic, cost tracking, and caching is a month of engineering you'll never get approved.

The Fix

FrugalRoute is one line of config between your app and your models.

# Before: hardcoded, expensive, blind
client = OpenAI(api_key="sk-...")

# After: routed, cached, tracked, learning
client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")

That's it. Same OpenAI SDK. Same code. FrugalRoute intercepts every request and makes a decision:

  1. Can a local model handle this? Run it on Ollama. Cost: $0.
  2. Seen this before? Return it from the semantic cache. Cost: $0. Latency: ~1ms.
  3. Needs more muscle? Escalate to the cloud — but only the cheapest cloud model that's capable enough.
  4. Learn from it. Every cloud call becomes training data. Next time, the local model handles it.

The more you use it, the less you spend.

curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is dependency injection?"}]
  }'

That returned a standard OpenAI response. Your app doesn't know or care which model answered. FrugalRoute picked a local Llama model, skipped the cloud entirely, and logged the cost as $0.00.

How It Works

    Your app                                                    Models
   ┌────────┐         ┌──────────────────────────────┐
   │ OpenAI │──HTTP──▶│         FrugalRoute           │──▶  Ollama     (local, free)
   │  SDK   │         │                               │──▶  OpenAI     (cloud, metered)
   │        │◀──JSON──│  :3100/v1/chat/completions    │──▶  Anthropic  (cloud, metered)
   └────────┘         └──────────────────────────────┘

The Cascade

Every request flows through a priority chain — cheapest first, escalate only when necessary.

  Semantic Cache ──hit?──▶ return instantly ($0)
       │ miss
  Keyword Classifier ──obvious?──▶ route directly (<1ms)
       │ uncertain
  Embedding Classifier ──▶ classify intent (~4ms)
       │
  Local Model (Ollama) ──confident?──▶ return ($0)
       │ low confidence
  Bigger Local Model ──confident?──▶ return ($0)
       │ still low
  Cloud Model (cheapest capable) ──▶ return ($$)
       │
  Collect training pair ──▶ distill into local models

The confidence threshold isn't static — it adapts per capability based on real performance data. Summarization might need 0.7 confidence locally. Code generation might need 0.95. FrugalRoute figures this out from your traffic.

The Flywheel

This is what no other router does.

Every time FrugalRoute escalates to the cloud, it captures the prompt and response as a training pair. Over time, you run the distillation pipeline, and your local models absorb the capabilities they used to delegate. Cloud spend decreases. Automatically.

  Traffic ──▶ Local model fails ──▶ Cloud handles it
                                         │
              Training pair collected ◀───┘
                     │
              Local model fine-tuned
                     │
              Next time: local model handles it ──▶ $0

The integrity layer (based on TruthKeeper research) ensures you never train on stale, contradicted, or low-quality data. Every training pair is dependency-tracked and integrity-verified before it touches your models.


Who It's For

Startups & Small Teams

You're shipping fast and watching costs. FrugalRoute gives you GPT-4-level output on a ramen budget. Local models handle the bulk — cloud kicks in only when it matters. No infra team required.

You'll love: Zero-config start, auto-learning, cost tracking per feature.

Enterprise & Platform Teams

You need governance, auditability, and vendor independence. FrugalRoute gives you per-key budgets, A/B testing across providers, full request provenance, and Prometheus metrics — without touching a single line of application code.

You'll love: Virtual API keys, guardrails pipeline, budget enforcement, self-hosted deployment.

AI/ML Engineers

You're tired of manually benchmarking models. FrugalRoute profiles your hardware, learns which models excel at what, and auto-adjusts routing weights from real traffic. The distillation pipeline means your local models get smarter over time — automatically.

You'll love: Judge agent, multi-sampling, TruthKeeper integrity, hardware auto-profiling.


Quickstart

Install from npm (recommended):

npm install -g frugalroute
frugalroute

Or run without installing:

npx frugalroute
# or with bun
bunx frugalroute

From source:

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
cp .env.example .env
bun run dev

Pull at least one local model and the embedding model:

ollama pull llama3.2
ollama pull nomic-embed-text

Point any OpenAI client at http://localhost:3100/v1 and set model to "auto".

Python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")
r = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain monads like I'm five"}]
)
print(r.choices[0].message.content)
TypeScript
import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:3100/v1", apiKey: "unused" });
const r = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }],
});
console.log(r.choices[0].message.content);
curl
curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Explain monads like I'm five"}]}'
Ruby
require "openai"

client = OpenAI::Client.new(uri_base: "http://localhost:3100/v1", access_token: "unused")
r = client.chat(parameters: {
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }]
})
puts r.dig("choices", 0, "message", "content")
Go
cfg := openai.DefaultConfig("unused")
cfg.BaseURL = "http://localhost:3100/v1"
client := openai.NewClientWithConfig(cfg)

resp, _ := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
    Model:    "auto",
    Messages: []openai.ChatCompletionMessage{{Role: "user", Content: "Explain monads like I'm five"}},
})
fmt.Println(resp.Choices[0].Message.Content)

All clients hit the same endpoint. FrugalRoute picks the model, runs inference, returns OpenAI-shaped JSON.


What's Under the Hood

Routing & Classification

  • Semantic intent classification via embeddings (nomic-embed-text)
  • Sub-1ms keyword pre-classifier for obvious cases
  • Composite scoring with cascade confidence
  • Capability matching — models declare strengths, requests state needs
  • Multi-model sampling with judge or majority voting
  • A/B testing with weighted traffic splits
  • Sticky sessions for multi-turn conversation consistency
  • Agent-specific routing strategies

Performance & Reliability

  • Two-tier cache: exact-match LRU + vector similarity
  • PeakEWMA latency tracking — routes around degraded providers
  • Error-type aware circuit breaker (429 vs 500 vs timeout)
  • Full SSE streaming with heartbeat keepalive
  • Graceful shutdown with in-flight request draining
  • Hardware auto-profiling (Apple Silicon, CUDA, ROCm)

Cost & Governance

  • Real-time cost tracking per request, key, session, and tag
  • Pre-flight budget enforcement — stops before it spends
  • Cache-aware pricing in routing decisions
  • Virtual API keys with independent limits per team
  • Token bucket rate limiting per key
  • Windowed budgets with configurable time windows

Learning & Distillation

  • Routing weights adapt from real success/failure signals
  • Judge agent for structural quality evaluation
  • Distillation pipeline: cloud responses train local models
  • TruthKeeper integrity layer prevents stale training data
  • Epistemic state tracking (Supported / Hypothesis / Contested)
  • Conversation compaction for long context management

Operations

  • Model aliases: fast, smart, cheap — decouple code from models
  • Prometheus metrics (frugalroute_*)
  • YAML model config (config/models.yaml)
  • OpenAPI spec at /openapi.json
  • One-command calibration tooling

Extensibility

  • MCP tool registry (MCP + OpenAI + Anthropic tools, unified)
  • Guardrails pipeline for pre/post content filtering
  • Provider adapters: Ollama, OpenAI, Anthropic
  • Plug in new providers by implementing one interface
  • Bidding/auction system for ambiguous routing decisions

The Competition

Every LLM gateway proxies requests. None of them think about them.

liteLLM OpenRouter Portkey Bifrost FrugalRoute
OpenAI-compatible drop-in Yes Yes Yes Yes Yes
Routes by capability, not model name Yes
Local-first (Ollama, Apple Silicon) Yes
Semantic intent classification Yes
Confidence-based escalation cascade Yes
Two-tier semantic cache Simple Yes
Learns from traffic, self-improves Yes
Distills cloud into local models Yes
Hardware auto-profiling Yes
Budget enforcement per key/session Partial Partial Yes
A/B testing across models Yes
MCP tool interoperability Partial Yes
Self-hosted, no vendor lock-in Yes Yes Yes Yes

liteLLM is a great proxy. It connects 100+ providers behind one API. But it doesn't know what your prompt needs — you still pick the model. No local tier, no caching, no learning.

OpenRouter is a managed marketplace. Not self-hosted. Your data leaves your network.

Portkey has solid reliability features — retries, fallbacks, circuit breaking. But it routes by provider weight, not by prompt intent. No local models. No distillation.

Bifrost is fast (11us overhead). But it's a load balancer, not a router. It doesn't understand what your request needs.

They move traffic. FrugalRoute makes decisions.


Configuration

# .env
PORT=3100
OLLAMA_BASE_URL=http://localhost:11434
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
EMBEDDING_MODEL=nomic-embed-text
DEFAULT_MAX_COST_PER_REQUEST=0.01
# config/models.yaml
aliases:
  fast: gemma3-4b
  smart: claude-sonnet-4-20250514
  cheap: llama3.2

Full configuration reference: docs/user/configuration.mdx


Documentation

Guide What it covers
Getting Started Install, first request, connect existing clients
Architecture Module map, request flow, design principles
Routing Classification, escalation, bidding, weight adjustment
Caching Two-tier semantic cache, adaptive thresholds
Cost Management Estimation, tracking, budget enforcement
Configuration Env vars, routes, models, budgets, thresholds
Deployment Docker, production hardening, hardware profiling
Tools & MCP Tool registry, MCP integration, format conversion
Distillation Training flywheel, TruthKeeper integrity
API Reference Complete HTTP endpoint reference

FAQ

Do I need Ollama installed?

For local inference, yes. FrugalRoute uses Ollama as its local model backend. Without it, requests route straight to cloud providers — which still gives you caching, cost tracking, and budget enforcement, but you miss the free local tier.

What models does it support?

Any model Ollama can run (Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, etc.), plus OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5) and Anthropic (Claude Opus, Sonnet, Haiku). Adding a new provider is one adapter interface.

Does it work with streaming?

Yes. Full SSE streaming with heartbeat keepalive, compatible with the OpenAI streaming format. Set "stream": true in your request — same as you would with OpenAI directly.

What's the routing overhead?

Keyword classification adds <1ms. Embedding-based classification adds ~4ms. Cache hits return in ~1ms. The routing decision itself is negligible compared to model inference time.

Can I force a specific model?

Yes. Set model to any registered model name (e.g., "gpt-4o", "llama3.2") instead of "auto". FrugalRoute will route directly to that model while still tracking cost and logging the request. You can also use aliases like "fast", "smart", or "cheap".

Is my data sent anywhere?

FrugalRoute is fully self-hosted. Local model requests never leave your machine. Cloud requests go directly to OpenAI/Anthropic — FrugalRoute never proxies through a third-party service. Training pairs for distillation are stored locally in SQLite.

How does distillation actually work?

When a request escalates to a cloud model, the prompt-response pair is captured, quality-scored by a judge agent, and stored locally. Running bun run distill feeds verified pairs into a fine-tuning pipeline for your local models. The TruthKeeper integrity layer ensures only high-quality, non-contradicted data is used. See Distillation docs.

What about function calling / tool use?

Supported. FrugalRoute's MCP tool registry unifies tools across MCP, OpenAI, and Anthropic formats. Tool calls are routed to the correct backend automatically.


Built With

Bun + Hono + Ollama + TypeScript

445 tests. 1,196 assertions. Two production dependencies.


Contributing

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
bun test           # run all 445 tests
bun run dev        # start dev server with hot reload
bun run lint       # lint with Biome
bun run benchmark  # run hardware benchmarks
bun run calibrate  # calibrate keyword classifier thresholds

PRs welcome. Please run bun run check (lint + tests) before submitting.


License

PolyForm Small Business License 1.0.0free for individuals, small businesses (<100 people, <1M EUR revenue), nonprofits, and open source projects.

Commercial license for larger organizations: lisa@tastehub.io


Stop overpaying for AI. Start routing.

bunx frugalroute

About

Capability-centric, local-first LLM routing layer

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages