OpenAI-compatible gateway for caching and cost control.
AI Cost Firewall is a lightweight OpenAI-compatible API gateway that reduces LLM API costs and latency by caching responses using exact matching and semantic similarity.
It sits between applications and LLM providers and forwards only necessary requests to the upstream API.
The project is developed and supported by the creators of VCAL Server.
LLM APIs are expensive and often receive repeated or semantically similar prompts.
Without caching, every request results in:
- unnecessary API calls
- increased token usage
- higher costs
- additional latency
AI Cost Firewall solves this by introducing a two-layer cache:
- Exact cache (Redis) -- instant responses for identical prompts\
- Semantic cache (Qdrant) -- reuse answers for similar prompts
Only cache misses are forwarded to the upstream LLM provider.
The firewall behaves similarly to "nginx for LLM APIs".
- OpenAI-compatible
/v1/chat/completionsendpoint - Exact request caching (Redis)
- Semantic cache (Qdrant)
- Token and cost savings metrics
- Prometheus observability
- Docker deployment
- nginx-style configuration
- Hot configuration reload (
SIGHUP) - Lightweight Rust + Axum implementation
Client applications send requests to the firewall instead of directly to the LLM provider.
flowchart TD
C[Client / SDK] --> F[AI Cost Firewall<br/>OpenAI-compatible API gateway]
F --> R[Redis / Valkey<br/>Exact cache]
F --> Q[Qdrant<br/>Semantic cache]
F --> U[Upstream LLM API<br/>OpenAI-compatible]
F --> P[Prometheus]
P --> G[Grafana]
Full architecture documentation:
The fastest way to try AI Cost Firewall is using Docker Compose.
Install:
- Docker
- Docker Compose (included with Docker Desktop)
Verify installation:
docker --version
docker compose versionDownload the example configuration:
curl -L https://raw.githubusercontent.com/vcal-project/ai-firewall/main/configs/ai-firewall.conf.example -o ai-firewall.confEdit the file and add your API keys:
nano ai-firewall.confYou must specify the exact model names returned by the API, for example:
gpt-4o-mini-2024-07-18
Download the Docker Compose file:
curl -L https://raw.githubusercontent.com/vcal-project/ai-firewall/main/docker-compose.yml -o docker-compose.ymlStart the stack:
```bash
docker compose pull
docker compose up -dView logs:
docker compose logs -f firewall| Service | URL |
|---|---|
| Firewall API | http://localhost:8080 |
| Prometheus | http://localhost:9090 |
| Grafana | http://localhost:3000 |
The stack includes:
- AI Cost Firewall
- Redis
- Qdrant
- Prometheus
- Grafana
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer <your-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini-2024-07-18",
"messages": [
{"role": "user", "content": "Explain Redis briefly."}
]
}'Prometheus metrics are available at:
Example metrics:
aif_requests_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_tokens_saved
aif_cost_saved_micro_usd
Token and cost savings are currently calculated only for:
/v1/chat/completions
Embedding requests used internally for semantic caching are not included in these metrics in the current version.
Clone the repository if you want to:
- explore the code
- modify configuration templates
- build the firewall locally
- contribute to the project
git clone https://github.com/vcal-project/ai-firewall.git
cd ai-firewallBuild the project:
cargo build --releaseRun the firewall:
cargo run --releaseAI Cost Firewall uses a simple nginx-style configuration format.
Example configuration:
listen_addr 0.0.0.0:8080;
redis_url redis://redis:6379;
upstream_base_url https://api.openai.com;
upstream_api_key sk-your-api-key;
embedding_base_url https://api.openai.com;
embedding_api_key sk-your-api-key;
embedding_model text-embedding-3-small;
qdrant_url http://qdrant:6334;
qdrant_collection aif_semantic_cache;
qdrant_vector_size 1536;
cache_ttl_seconds 2592000;
request_timeout_seconds 120;
semantic_cache_enabled true;
semantic_similarity_threshold 0.92;
# Chat-completion pricing (USD per 1M tokens)
# model_price <model> <input_usd_per_1m_tokens> <output_usd_per_1m_tokens>;
model_price gpt-4o-mini-2024-07-18 0.15 0.60;
model_price gpt-4.1-mini-2025-04-14 0.30 1.20;
model_pricematching is exact in v0.1.0.
If the API returnsgpt-4o-mini-2024-07-18, the same name must appear in the configuration.
Full configuration reference:
docs/config-reference.md
| Document | Description |
|---|---|
docs/architecture.md |
System architecture |
docs/config-reference.md |
Configuration directives |
docs/faq.md |
Frequently asked questions |
docs/how-it-works.md |
Request flow and caching logic |
docs/quickstart.md |
Full setup guide |
Contributions are welcome.
If you would like to contribute to AI Cost Firewall — whether through bug reports, feature suggestions, documentation improvements, or code — please see:
Before submitting a pull request, please open an issue to discuss the change.
We welcome improvements in:
- performance
- documentation
- testing
- integrations with LLM providers
- observability and metrics
AI Cost Firewall can optionally integrate with VCAL Server for advanced semantic caching and distributed vector storage.
VCAL Server project:
Apache License 2.0