A multi-provider LLM abstraction layer with automatic failover, graduated circuit breakers, cost tracking, and intelligent retry. Built for Cloudflare Workers but runs anywhere with a standard fetch API. Extracted from a production orchestration platform handling 80K+ LOC across multiple services.
- Multi-provider failover -- OpenAI, Anthropic, Cloudflare Workers AI, Cerebras, and Groq behind a single interface
- Graduated circuit breaker -- 4-state machine (closed / degraded / recovering / open) with probabilistic traffic routing prevents cascading failures
- Exponential backoff retry -- configurable delays, jitter, and per-error-class behavior
- Cost tracking and optimization -- per-provider cost attribution, budget alerts with CreditLedger, automatic routing to cheaper providers
- Declarative model catalog -- semantic model metadata drives recommendations, provider defaults, and fallback routing
- Rate limit enforcement -- CreditLedger tracks RPM/RPD/TPM/TPD per provider; factory skips providers that exceed limits
- Streaming with fallback -- SSE streaming on all providers; factory-level streaming routes through the same circuit-breaker and fallback chain as non-streaming requests
- Tool/function calling -- OpenAI, Anthropic, Cerebras, and Cloudflare tool use with unified response format
- Tool-use loop helper --
generateResponseWithToolsowns the request → parse → execute → repeat cycle with iteration caps, cost limits, and abort signal support - Provider-agnostic cache hints --
LLMRequest.cachetranslates to provider-native caching (Anthropiccache_controlbreakpoints; automatic on OpenAI/Groq/Cerebras); cached token counts normalized intoTokenUsage - Schema drift detection -- envelope validation on every provider response; streaming frames validated per-chunk;
SchemaDriftErrorroutes through fallback chain and firesonSchemaDrifthook - Schema canary --
runCanaryCheck/extractShape/compareShapesfor comparing live response shapes against committed golden fixtures - Image generation -- Cloudflare Workers AI (SDXL, FLUX) and Google Gemini
- Health monitoring -- per-provider health checks, metrics, and circuit breaker state
- Structured logging -- injectable
Loggerinterface; silent by default, opt-in to console or custom loggers - Zero runtime dependencies -- no transitive dependency tree to audit
npm install @stackbilt/llm-providersimport { LLMProviders, MODELS } from '@stackbilt/llm-providers';
const llm = new LLMProviders({
openai: { apiKey: process.env.OPENAI_API_KEY },
anthropic: { apiKey: process.env.ANTHROPIC_API_KEY },
cloudflare: { ai: env.AI }, // Cloudflare Workers AI binding
defaultProvider: 'auto',
costOptimization: true,
enableCircuitBreaker: true,
});
const response = await llm.generateResponse({
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Summarize the circuit breaker pattern.' },
],
maxTokens: 1000,
temperature: 0.7,
});
console.log(response.message);
console.log(`Provider: ${response.provider}, Cost: $${response.usage.cost}`);import { LLMProviders } from '@stackbilt/llm-providers';
// Scans env for ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY,
// CEREBRAS_API_KEY, and AI binding — configures only what's present
const llm = LLMProviders.fromEnv(env, {
costOptimization: true,
enableCircuitBreaker: true,
});| Provider | Models | Streaming | Tools | Notes |
|---|---|---|---|---|
| OpenAI | GPT-4o Mini, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo | Yes | Yes | Default: gpt-4o-mini |
| Anthropic | Claude Opus 4.6, Sonnet 4.6, Sonnet 4, Haiku 4.5, 3.7 Sonnet, 3.5 Sonnet/Haiku, 3 Opus/Sonnet | Yes | Yes | Default: claude-haiku-4-5-20251001 |
| Cloudflare | Gemma 4 26B, Llama 4 Scout, GPT-OSS 120B, LLaMA 3.x, Mistral 7B, Qwen 1.5, TinyLlama, and more | Yes | GPT-OSS, Gemma 4, Llama 4 Scout | Default is request-aware and catalog-driven |
| Cerebras | LLaMA 3.1 8B, LLaMA 3.3 70B, ZAI-GLM 4.7, Qwen 3 235B | Yes | GLM/Qwen only | ~2,200 tok/s |
| Groq | LLaMA 3.3 70B Versatile, LLaMA 3.1 8B Instant, GPT-OSS 120B | Yes | LLaMA 3.3 70B, GPT-OSS 120B | Ultra-fast inference |
// OpenAI
{ apiKey: 'sk-...', organization: 'org-...', project: 'proj-...' }
// Anthropic
{ apiKey: 'sk-ant-...', version: '2023-06-01' }
// Cloudflare Workers AI
{ ai: env.AI, accountId: '...' }
// Cerebras
{ apiKey: 'csk-...' }
// Groq
{ apiKey: 'gsk_...' }The library is silent by default. Opt in to logging by passing a Logger:
import { LLMProviders, consoleLogger } from '@stackbilt/llm-providers';
const llm = new LLMProviders({
anthropic: { apiKey: '...', logger: consoleLogger },
logger: consoleLogger, // factory-level logging
});Or implement your own Logger interface (debug, info, warn, error).
Each provider gets a graduated circuit breaker that routes traffic away from failing providers with probabilistic degradation.
| State | Behavior |
|---|---|
| Closed | 100% traffic to primary. Failures increment counter. |
| Degraded | Traffic splits probabilistically (90% → 70% → 40% → 10%) as failures accumulate. |
| Recovering | Success steps traffic back up one level at a time. |
| Open | 0% traffic. After resetTimeout ms, failures decay and traffic resumes. |
Default: 5-step degradation curve [1.0, 0.9, 0.7, 0.4, 0.1], 60s reset timeout, 5-minute monitoring window.
import { CircuitBreakerManager } from '@stackbilt/llm-providers';
const manager = new CircuitBreakerManager({
failureThreshold: 5,
resetTimeout: 60000,
monitoringPeriod: 300000,
degradationCurve: [1.0, 0.9, 0.7, 0.4, 0.1],
});
const breaker = manager.getBreaker('openai');
console.log(breaker.getHealth());import { CreditLedger, LLMProviders } from '@stackbilt/llm-providers';
const ledger = new CreditLedger({
budgets: [
{ provider: 'openai', monthlyBudget: 50, rateLimits: { rpm: 60, rpd: 10000 } },
{ provider: 'anthropic', monthlyBudget: 100 },
],
});
// Threshold alerts fire at 80%, 90%, 95% utilization
ledger.on((event) => {
if (event.type === 'threshold_crossed') {
console.warn(`${event.provider}: ${event.tier} — ${event.utilizationPct.toFixed(0)}% of budget`);
}
});
const llm = new LLMProviders({
openai: { apiKey: '...' },
anthropic: { apiKey: '...' },
costOptimization: true,
ledger, // Factory enforces rate limits and tracks spend
});Model selection is driven by a declarative catalog rather than a hardcoded fallback array. The selector intersects:
- requested use case and capabilities
- configured providers
- circuit breaker state (
CLOSED,DEGRADED,RECOVERING,OPEN) - CreditLedger utilization and projected burn/depletion pressure
The catalog also distinguishes active, compatibility, and retired models. Retired IDs can remain exported for compatibility, but they are not recommendation targets.
import {
MODEL_CATALOG,
MODEL_RECOMMENDATIONS,
getRecommendedModel,
inferUseCaseFromRequest
} from '@stackbilt/llm-providers';
const useCase = inferUseCaseFromRequest({
messages: [{ role: 'user', content: 'Call the weather tool' }],
tools: [{
type: 'function',
function: {
name: 'get_weather',
description: 'Get weather',
parameters: { type: 'object' }
}
}]
});
const model = getRecommendedModel('TOOL_CALLING', ['cloudflare', 'openai']);For runtime-aware recommendations from a configured instance:
const recommended = llm.getRecommendedModel({
messages: [{ role: 'user', content: 'Summarize this incident' }],
maxTokens: 800
});Customize when and how the factory falls back between providers:
const llm = new LLMProviders({
openai: { apiKey: '...' },
anthropic: { apiKey: '...' },
cloudflare: { ai: env.AI },
cerebras: { apiKey: '...' },
fallbackRules: [
{ condition: 'rate_limit', fallbackProvider: 'cloudflare' },
{ condition: 'cost', threshold: 10, fallbackProvider: 'cloudflare' },
{ condition: 'error', fallbackProvider: 'anthropic' },
],
});Default provider precedence remains Cloudflare → Cerebras → Groq → Anthropic → OpenAI, but actual dispatch is catalog-driven and can be reordered at runtime by request fit, circuit-breaker state, and ledger burn-rate pressure.
Structured error classes for each failure mode:
import {
RateLimitError,
QuotaExceededError,
AuthenticationError,
CircuitBreakerOpenError,
TimeoutError,
} from '@stackbilt/llm-providers';
try {
await llm.generateResponse(request);
} catch (error) {
if (error instanceof RateLimitError) {
// Automatic retry already attempted; consider switching providers
} else if (error instanceof CircuitBreakerOpenError) {
// Provider is temporarily disabled
} else if (error instanceof AuthenticationError) {
// Check API key -- will NOT trigger fallback
}
}import { MODELS, getRecommendedModel } from '@stackbilt/llm-providers';
// Current-gen models
MODELS.CLAUDE_OPUS_4_6; // 'claude-opus-4-6-20250618'
MODELS.CLAUDE_SONNET_4_6; // 'claude-sonnet-4-6-20250618'
MODELS.CLAUDE_HAIKU_4_5; // 'claude-haiku-4-5-20251001'
MODELS.GPT_4O; // 'gpt-4o' (deprecated / compatibility only)
MODELS.GPT_4O_MINI; // 'gpt-4o-mini'
MODELS.CEREBRAS_ZAI_GLM_4_7; // 'zai-glm-4.7'
// Get best active model for a use case given available providers
const model = getRecommendedModel('COST_EFFECTIVE', ['openai', 'cloudflare']);generateResponseStream uses the same provider-selection, circuit-breaker, and exhaustion-registry path as generateResponse. Pre-stream HTTP errors (401, 429, 503, circuit open) fall over to the next provider before emitting the first chunk.
const stream = await llm.generateResponseStream({
messages: [{ role: 'user', content: 'Tell me a story.' }],
model: 'claude-haiku-4-5-20251001',
});
const reader = stream.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
process.stdout.write(value); // string chunk
}generateResponseWithTools owns the generateResponse → parse → execute → append → repeat cycle. It enforces iteration caps, cumulative cost limits, and abort-signal cancellation — no boilerplate needed on the caller side.
import { LLMProviders, ToolLoopLimitError } from '@stackbilt/llm-providers';
const result = await llm.generateResponseWithTools(
{
messages: [{ role: 'user', content: 'What is 2 + 2 * 3?' }],
tools: [{
type: 'function',
function: {
name: 'calculate',
description: 'Evaluate a math expression',
parameters: { type: 'object', properties: { expr: { type: 'string' } }, required: ['expr'] }
}
}],
},
{
execute: async (name, args) => {
if (name === 'calculate') return eval((args as { expr: string }).expr);
throw new Error(`Unknown tool: ${name}`);
}
},
{ maxIterations: 5, maxCostUSD: 0.10 }
);
console.log(result.message); // final assistant response after tool executionPass a provider-agnostic cache hint on any request. The library translates it to the appropriate provider-native mechanism.
const response = await llm.generateResponse({
messages: [{ role: 'user', content: 'Summarize the context.' }],
systemPrompt: 'You are an expert at analyzing long documents. [... 10KB of stable context ...]',
model: 'claude-haiku-4-5-20251001',
cache: {
strategy: 'provider-prefix', // mark the stable prefix for caching
cacheablePrefix: 'auto', // cache system prompt + tools (default)
},
});
// Cached token counts are normalized in TokenUsage
console.log(response.usage.cacheReadInputTokens); // Anthropic cache hit tokens
console.log(response.usage.cachedInputTokens); // OpenAI / Groq / Cerebras cache hit tokens| Strategy | Behavior |
|---|---|
'off' |
No caching hints sent |
'provider-prefix' |
Mark stable prefix for provider-side caching |
'response' |
Enable AI Gateway response caching (via GatewayMetadata) |
'both' |
Both prefix and response caching |
Use the canary utilities to compare a live provider response against a committed golden fixture and detect API shape drift before it reaches production.
import {
extractShape, compareShapes, runCanaryCheck
} from '@stackbilt/llm-providers';
// 1. Load your committed golden fixture (flat path → type map)
import goldenShape from './fixtures/openai.json';
// 2. Fetch a raw response from the provider (your responsibility)
const liveResponse = await fetch('https://api.openai.com/v1/chat/completions', ...).then(r => r.json());
// 3. Check for drift
const report = runCanaryCheck('openai', goldenShape, liveResponse);
if (report.status === 'drift') {
console.error('OpenAI response shape changed!', report.diff);
// diff.added — new fields (additive, usually safe)
// diff.removed — missing fields (breaking, alert immediately)
// diff.changed — type-changed fields (breaking, alert immediately)
}Generate your initial golden fixture from a known-good response:
import { extractShape } from '@stackbilt/llm-providers';
import fs from 'fs';
const shape = extractShape(knownGoodResponse);
fs.writeFileSync('fixtures/openai.json', JSON.stringify(shape, null, 2));| Class | Description |
|---|---|
LLMProviders |
High-level facade -- initialize providers, generate responses, check health |
LLMProviderFactory |
Lower-level factory with provider chain building, catalog-based routing, and fallback logic |
OpenAIProvider |
OpenAI GPT models (streaming, tools) |
AnthropicProvider |
Anthropic Claude models (streaming, tools) |
CloudflareProvider |
Cloudflare Workers AI (streaming, tools on GPT-OSS/Gemma 4/Llama 4, batch) |
CerebrasProvider |
Cerebras fast inference (streaming, tools on GLM/Qwen) |
GroqProvider |
Groq fast inference (streaming, tools on GPT-OSS/LLaMA 3.3 70B) |
BaseProvider |
Abstract base with shared resiliency, metrics, and cost calculation |
| Class / Export | Description |
|---|---|
CircuitBreaker |
Graduated 4-state circuit breaker with probabilistic degradation |
CircuitBreakerManager |
Manages circuit breakers across multiple providers |
RetryManager |
Exponential backoff retry with jitter |
CostTracker |
Per-provider cost accumulation and budget alerts |
CreditLedger |
Monthly budgets, rate limits, burn rate projection, threshold events |
CostOptimizer |
Static methods for optimal provider selection |
MODEL_CATALOG |
Declarative model metadata for routing and recommendation |
ImageProvider |
Multi-provider image generation (Cloudflare SDXL/FLUX, Google Gemini) |
extractShape |
Walk a raw API response into a flat path → type shape map |
compareShapes |
Diff two shape maps into { added, removed, changed } |
runCanaryCheck |
One-shot canary: extract live shape, compare against golden, return CanaryReport |
validateSchema |
Low-level envelope validator (for custom provider authors) |
| Export | Description |
|---|---|
Logger |
Interface: debug, info, warn, error methods |
noopLogger |
Silent logger (default) |
consoleLogger |
Forwards to console.* (opt-in) |
| Type | Description |
|---|---|
LLMRequest |
Unified request: messages, model, temperature, tools, response_format, cache, lora |
LLMResponse |
Unified response: message, usage (with cost), provider, tool calls |
TokenUsage |
Token counts, cost, and cached token fields (cachedInputTokens, cacheReadInputTokens, cacheCreationInputTokens) |
CacheHints |
Cache strategy, key, ttl, sessionId, cacheablePrefix for provider-agnostic prompt caching |
ToolExecutor |
Interface for generateResponseWithTools: execute(name, args) => Promise<unknown> |
ToolLoopOptions |
Loop config: maxIterations, maxCostUSD, onIteration, abortSignal |
CanaryReport |
Schema canary result: provider, status ('ok' |
ShapeMap |
Flat path → JSON-type map produced by extractShape |
ProviderFactoryConfig |
Factory config: provider configs, fallback rules, ledger, logger |
CostAnalytics |
Cost breakdown, total, and recommendations |
ProviderHealthEntry |
Health status, metrics, circuit breaker state, capabilities |
ModelCatalogEntry |
Declarative model metadata: provider, lifecycle, capabilities, use cases |
| Function | Description |
|---|---|
createLLMProviders(config) |
Create an LLMProviders instance |
createCostOptimizedLLMProviders(config) |
Create with cost optimization, circuit breakers, and retries enabled |
LLMProviders.fromEnv(env) |
Auto-discover providers from environment variables |
llm.generateResponse(request) |
Generate a response with provider selection and fallback |
llm.generateResponseStream(request) |
Streaming generation; fallback chain active before first chunk |
llm.generateResponseWithTools(request, executor, opts?) |
Managed tool-use loop with caps and abort-signal support |
llm.getRecommendedModel(request, useCase?) |
Runtime recommendation using configured providers, health, and ledger state |
getRecommendedModel(useCase, providers, context?) |
Pick the best active model for a use case |
runCanaryCheck(provider, golden, liveResponse) |
Compare live response shape against golden fixture |
extractShape(obj) |
Extract flat path → type map from any object |
retry(fn, config) |
One-shot retry wrapper for any async function |
Apache-2.0
Built by Stackbilt.