Skip to content

Stackbilt-dev/llm-providers

@stackbilt/llm-providers

A multi-provider LLM abstraction layer with automatic failover, graduated circuit breakers, cost tracking, and intelligent retry. Built for Cloudflare Workers but runs anywhere with a standard fetch API. Extracted from a production orchestration platform handling 80K+ LOC across multiple services.

Features

  • Multi-provider failover -- OpenAI, Anthropic, Cloudflare Workers AI, Cerebras, and Groq behind a single interface
  • Graduated circuit breaker -- 4-state machine (closed / degraded / recovering / open) with probabilistic traffic routing prevents cascading failures
  • Exponential backoff retry -- configurable delays, jitter, and per-error-class behavior
  • Cost tracking and optimization -- per-provider cost attribution, budget alerts with CreditLedger, automatic routing to cheaper providers
  • Declarative model catalog -- semantic model metadata drives recommendations, provider defaults, and fallback routing
  • Rate limit enforcement -- CreditLedger tracks RPM/RPD/TPM/TPD per provider; factory skips providers that exceed limits
  • Streaming with fallback -- SSE streaming on all providers; factory-level streaming routes through the same circuit-breaker and fallback chain as non-streaming requests
  • Tool/function calling -- OpenAI, Anthropic, Cerebras, and Cloudflare tool use with unified response format
  • Tool-use loop helper -- generateResponseWithTools owns the request → parse → execute → repeat cycle with iteration caps, cost limits, and abort signal support
  • Provider-agnostic cache hints -- LLMRequest.cache translates to provider-native caching (Anthropic cache_control breakpoints; automatic on OpenAI/Groq/Cerebras); cached token counts normalized into TokenUsage
  • Schema drift detection -- envelope validation on every provider response; streaming frames validated per-chunk; SchemaDriftError routes through fallback chain and fires onSchemaDrift hook
  • Schema canary -- runCanaryCheck / extractShape / compareShapes for comparing live response shapes against committed golden fixtures
  • Image generation -- Cloudflare Workers AI (SDXL, FLUX) and Google Gemini
  • Health monitoring -- per-provider health checks, metrics, and circuit breaker state
  • Structured logging -- injectable Logger interface; silent by default, opt-in to console or custom loggers
  • Zero runtime dependencies -- no transitive dependency tree to audit

Installation

npm install @stackbilt/llm-providers

Quick Start

import { LLMProviders, MODELS } from '@stackbilt/llm-providers';

const llm = new LLMProviders({
  openai: { apiKey: process.env.OPENAI_API_KEY },
  anthropic: { apiKey: process.env.ANTHROPIC_API_KEY },
  cloudflare: { ai: env.AI }, // Cloudflare Workers AI binding
  defaultProvider: 'auto',
  costOptimization: true,
  enableCircuitBreaker: true,
});

const response = await llm.generateResponse({
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Summarize the circuit breaker pattern.' },
  ],
  maxTokens: 1000,
  temperature: 0.7,
});

console.log(response.message);
console.log(`Provider: ${response.provider}, Cost: $${response.usage.cost}`);

Auto-Discovery from Environment

import { LLMProviders } from '@stackbilt/llm-providers';

// Scans env for ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY,
// CEREBRAS_API_KEY, and AI binding — configures only what's present
const llm = LLMProviders.fromEnv(env, {
  costOptimization: true,
  enableCircuitBreaker: true,
});

Providers

Provider Models Streaming Tools Notes
OpenAI GPT-4o Mini, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo Yes Yes Default: gpt-4o-mini
Anthropic Claude Opus 4.6, Sonnet 4.6, Sonnet 4, Haiku 4.5, 3.7 Sonnet, 3.5 Sonnet/Haiku, 3 Opus/Sonnet Yes Yes Default: claude-haiku-4-5-20251001
Cloudflare Gemma 4 26B, Llama 4 Scout, GPT-OSS 120B, LLaMA 3.x, Mistral 7B, Qwen 1.5, TinyLlama, and more Yes GPT-OSS, Gemma 4, Llama 4 Scout Default is request-aware and catalog-driven
Cerebras LLaMA 3.1 8B, LLaMA 3.3 70B, ZAI-GLM 4.7, Qwen 3 235B Yes GLM/Qwen only ~2,200 tok/s
Groq LLaMA 3.3 70B Versatile, LLaMA 3.1 8B Instant, GPT-OSS 120B Yes LLaMA 3.3 70B, GPT-OSS 120B Ultra-fast inference

Provider Configuration

// OpenAI
{ apiKey: 'sk-...', organization: 'org-...', project: 'proj-...' }

// Anthropic
{ apiKey: 'sk-ant-...', version: '2023-06-01' }

// Cloudflare Workers AI
{ ai: env.AI, accountId: '...' }

// Cerebras
{ apiKey: 'csk-...' }

// Groq
{ apiKey: 'gsk_...' }

Logging

The library is silent by default. Opt in to logging by passing a Logger:

import { LLMProviders, consoleLogger } from '@stackbilt/llm-providers';

const llm = new LLMProviders({
  anthropic: { apiKey: '...', logger: consoleLogger },
  logger: consoleLogger, // factory-level logging
});

Or implement your own Logger interface (debug, info, warn, error).

Circuit Breaker

Each provider gets a graduated circuit breaker that routes traffic away from failing providers with probabilistic degradation.

State Behavior
Closed 100% traffic to primary. Failures increment counter.
Degraded Traffic splits probabilistically (90% → 70% → 40% → 10%) as failures accumulate.
Recovering Success steps traffic back up one level at a time.
Open 0% traffic. After resetTimeout ms, failures decay and traffic resumes.

Default: 5-step degradation curve [1.0, 0.9, 0.7, 0.4, 0.1], 60s reset timeout, 5-minute monitoring window.

import { CircuitBreakerManager } from '@stackbilt/llm-providers';

const manager = new CircuitBreakerManager({
  failureThreshold: 5,
  resetTimeout: 60000,
  monitoringPeriod: 300000,
  degradationCurve: [1.0, 0.9, 0.7, 0.4, 0.1],
});

const breaker = manager.getBreaker('openai');
console.log(breaker.getHealth());

Cost Tracking & Budget Management

import { CreditLedger, LLMProviders } from '@stackbilt/llm-providers';

const ledger = new CreditLedger({
  budgets: [
    { provider: 'openai', monthlyBudget: 50, rateLimits: { rpm: 60, rpd: 10000 } },
    { provider: 'anthropic', monthlyBudget: 100 },
  ],
});

// Threshold alerts fire at 80%, 90%, 95% utilization
ledger.on((event) => {
  if (event.type === 'threshold_crossed') {
    console.warn(`${event.provider}: ${event.tier}${event.utilizationPct.toFixed(0)}% of budget`);
  }
});

const llm = new LLMProviders({
  openai: { apiKey: '...' },
  anthropic: { apiKey: '...' },
  costOptimization: true,
  ledger, // Factory enforces rate limits and tracks spend
});

Model Catalog & Runtime Selection

Model selection is driven by a declarative catalog rather than a hardcoded fallback array. The selector intersects:

  • requested use case and capabilities
  • configured providers
  • circuit breaker state (CLOSED, DEGRADED, RECOVERING, OPEN)
  • CreditLedger utilization and projected burn/depletion pressure

The catalog also distinguishes active, compatibility, and retired models. Retired IDs can remain exported for compatibility, but they are not recommendation targets.

import {
  MODEL_CATALOG,
  MODEL_RECOMMENDATIONS,
  getRecommendedModel,
  inferUseCaseFromRequest
} from '@stackbilt/llm-providers';

const useCase = inferUseCaseFromRequest({
  messages: [{ role: 'user', content: 'Call the weather tool' }],
  tools: [{
    type: 'function',
    function: {
      name: 'get_weather',
      description: 'Get weather',
      parameters: { type: 'object' }
    }
  }]
});

const model = getRecommendedModel('TOOL_CALLING', ['cloudflare', 'openai']);

For runtime-aware recommendations from a configured instance:

const recommended = llm.getRecommendedModel({
  messages: [{ role: 'user', content: 'Summarize this incident' }],
  maxTokens: 800
});

Fallback Rules

Customize when and how the factory falls back between providers:

const llm = new LLMProviders({
  openai: { apiKey: '...' },
  anthropic: { apiKey: '...' },
  cloudflare: { ai: env.AI },
  cerebras: { apiKey: '...' },
  fallbackRules: [
    { condition: 'rate_limit', fallbackProvider: 'cloudflare' },
    { condition: 'cost', threshold: 10, fallbackProvider: 'cloudflare' },
    { condition: 'error', fallbackProvider: 'anthropic' },
  ],
});

Default provider precedence remains Cloudflare → Cerebras → Groq → Anthropic → OpenAI, but actual dispatch is catalog-driven and can be reordered at runtime by request fit, circuit-breaker state, and ledger burn-rate pressure.

Error Handling

Structured error classes for each failure mode:

import {
  RateLimitError,
  QuotaExceededError,
  AuthenticationError,
  CircuitBreakerOpenError,
  TimeoutError,
} from '@stackbilt/llm-providers';

try {
  await llm.generateResponse(request);
} catch (error) {
  if (error instanceof RateLimitError) {
    // Automatic retry already attempted; consider switching providers
  } else if (error instanceof CircuitBreakerOpenError) {
    // Provider is temporarily disabled
  } else if (error instanceof AuthenticationError) {
    // Check API key -- will NOT trigger fallback
  }
}

Model Constants

import { MODELS, getRecommendedModel } from '@stackbilt/llm-providers';

// Current-gen models
MODELS.CLAUDE_OPUS_4_6;         // 'claude-opus-4-6-20250618'
MODELS.CLAUDE_SONNET_4_6;       // 'claude-sonnet-4-6-20250618'
MODELS.CLAUDE_HAIKU_4_5;        // 'claude-haiku-4-5-20251001'
MODELS.GPT_4O;                  // 'gpt-4o' (deprecated / compatibility only)
MODELS.GPT_4O_MINI;             // 'gpt-4o-mini'
MODELS.CEREBRAS_ZAI_GLM_4_7;    // 'zai-glm-4.7'

// Get best active model for a use case given available providers
const model = getRecommendedModel('COST_EFFECTIVE', ['openai', 'cloudflare']);

Factory-Level Streaming

generateResponseStream uses the same provider-selection, circuit-breaker, and exhaustion-registry path as generateResponse. Pre-stream HTTP errors (401, 429, 503, circuit open) fall over to the next provider before emitting the first chunk.

const stream = await llm.generateResponseStream({
  messages: [{ role: 'user', content: 'Tell me a story.' }],
  model: 'claude-haiku-4-5-20251001',
});

const reader = stream.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  process.stdout.write(value); // string chunk
}

Tool-Use Loop

generateResponseWithTools owns the generateResponse → parse → execute → append → repeat cycle. It enforces iteration caps, cumulative cost limits, and abort-signal cancellation — no boilerplate needed on the caller side.

import { LLMProviders, ToolLoopLimitError } from '@stackbilt/llm-providers';

const result = await llm.generateResponseWithTools(
  {
    messages: [{ role: 'user', content: 'What is 2 + 2 * 3?' }],
    tools: [{
      type: 'function',
      function: {
        name: 'calculate',
        description: 'Evaluate a math expression',
        parameters: { type: 'object', properties: { expr: { type: 'string' } }, required: ['expr'] }
      }
    }],
  },
  {
    execute: async (name, args) => {
      if (name === 'calculate') return eval((args as { expr: string }).expr);
      throw new Error(`Unknown tool: ${name}`);
    }
  },
  { maxIterations: 5, maxCostUSD: 0.10 }
);

console.log(result.message); // final assistant response after tool execution

Prompt Cache Hints

Pass a provider-agnostic cache hint on any request. The library translates it to the appropriate provider-native mechanism.

const response = await llm.generateResponse({
  messages: [{ role: 'user', content: 'Summarize the context.' }],
  systemPrompt: 'You are an expert at analyzing long documents. [... 10KB of stable context ...]',
  model: 'claude-haiku-4-5-20251001',
  cache: {
    strategy: 'provider-prefix',   // mark the stable prefix for caching
    cacheablePrefix: 'auto',       // cache system prompt + tools (default)
  },
});

// Cached token counts are normalized in TokenUsage
console.log(response.usage.cacheReadInputTokens);    // Anthropic cache hit tokens
console.log(response.usage.cachedInputTokens);       // OpenAI / Groq / Cerebras cache hit tokens
Strategy Behavior
'off' No caching hints sent
'provider-prefix' Mark stable prefix for provider-side caching
'response' Enable AI Gateway response caching (via GatewayMetadata)
'both' Both prefix and response caching

Schema Drift Canary

Use the canary utilities to compare a live provider response against a committed golden fixture and detect API shape drift before it reaches production.

import {
  extractShape, compareShapes, runCanaryCheck
} from '@stackbilt/llm-providers';

// 1. Load your committed golden fixture (flat path → type map)
import goldenShape from './fixtures/openai.json';

// 2. Fetch a raw response from the provider (your responsibility)
const liveResponse = await fetch('https://api.openai.com/v1/chat/completions', ...).then(r => r.json());

// 3. Check for drift
const report = runCanaryCheck('openai', goldenShape, liveResponse);

if (report.status === 'drift') {
  console.error('OpenAI response shape changed!', report.diff);
  // diff.added   — new fields (additive, usually safe)
  // diff.removed — missing fields (breaking, alert immediately)
  // diff.changed — type-changed fields (breaking, alert immediately)
}

Generate your initial golden fixture from a known-good response:

import { extractShape } from '@stackbilt/llm-providers';
import fs from 'fs';

const shape = extractShape(knownGoodResponse);
fs.writeFileSync('fixtures/openai.json', JSON.stringify(shape, null, 2));

API Reference

Core Classes

Class Description
LLMProviders High-level facade -- initialize providers, generate responses, check health
LLMProviderFactory Lower-level factory with provider chain building, catalog-based routing, and fallback logic
OpenAIProvider OpenAI GPT models (streaming, tools)
AnthropicProvider Anthropic Claude models (streaming, tools)
CloudflareProvider Cloudflare Workers AI (streaming, tools on GPT-OSS/Gemma 4/Llama 4, batch)
CerebrasProvider Cerebras fast inference (streaming, tools on GLM/Qwen)
GroqProvider Groq fast inference (streaming, tools on GPT-OSS/LLaMA 3.3 70B)
BaseProvider Abstract base with shared resiliency, metrics, and cost calculation

Utilities

Class / Export Description
CircuitBreaker Graduated 4-state circuit breaker with probabilistic degradation
CircuitBreakerManager Manages circuit breakers across multiple providers
RetryManager Exponential backoff retry with jitter
CostTracker Per-provider cost accumulation and budget alerts
CreditLedger Monthly budgets, rate limits, burn rate projection, threshold events
CostOptimizer Static methods for optimal provider selection
MODEL_CATALOG Declarative model metadata for routing and recommendation
ImageProvider Multi-provider image generation (Cloudflare SDXL/FLUX, Google Gemini)
extractShape Walk a raw API response into a flat path → type shape map
compareShapes Diff two shape maps into { added, removed, changed }
runCanaryCheck One-shot canary: extract live shape, compare against golden, return CanaryReport
validateSchema Low-level envelope validator (for custom provider authors)

Logger

Export Description
Logger Interface: debug, info, warn, error methods
noopLogger Silent logger (default)
consoleLogger Forwards to console.* (opt-in)

Key Types

Type Description
LLMRequest Unified request: messages, model, temperature, tools, response_format, cache, lora
LLMResponse Unified response: message, usage (with cost), provider, tool calls
TokenUsage Token counts, cost, and cached token fields (cachedInputTokens, cacheReadInputTokens, cacheCreationInputTokens)
CacheHints Cache strategy, key, ttl, sessionId, cacheablePrefix for provider-agnostic prompt caching
ToolExecutor Interface for generateResponseWithTools: execute(name, args) => Promise<unknown>
ToolLoopOptions Loop config: maxIterations, maxCostUSD, onIteration, abortSignal
CanaryReport Schema canary result: provider, status ('ok'
ShapeMap Flat path → JSON-type map produced by extractShape
ProviderFactoryConfig Factory config: provider configs, fallback rules, ledger, logger
CostAnalytics Cost breakdown, total, and recommendations
ProviderHealthEntry Health status, metrics, circuit breaker state, capabilities
ModelCatalogEntry Declarative model metadata: provider, lifecycle, capabilities, use cases

Factory Functions

Function Description
createLLMProviders(config) Create an LLMProviders instance
createCostOptimizedLLMProviders(config) Create with cost optimization, circuit breakers, and retries enabled
LLMProviders.fromEnv(env) Auto-discover providers from environment variables
llm.generateResponse(request) Generate a response with provider selection and fallback
llm.generateResponseStream(request) Streaming generation; fallback chain active before first chunk
llm.generateResponseWithTools(request, executor, opts?) Managed tool-use loop with caps and abort-signal support
llm.getRecommendedModel(request, useCase?) Runtime recommendation using configured providers, health, and ledger state
getRecommendedModel(useCase, providers, context?) Pick the best active model for a use case
runCanaryCheck(provider, golden, liveResponse) Compare live response shape against golden fixture
extractShape(obj) Extract flat path → type map from any object
retry(fn, config) One-shot retry wrapper for any async function

License

Apache-2.0


Built by Stackbilt.

About

Multi-LLM failover with circuit breakers, cost tracking, and intelligent retry. Cloudflare Workers native.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors