This project provides a self-hosted FastAPI proxy server designed to integrate with any OpenAI-compatible API service. It implements a dual-model critique pipeline: an initial response is generated by a primary LLM, and then a second LLM (the critique model) refines or critiques this response. The proxy provides full OpenAI API compatibility for seamless integration with various IDEs and tools, allowing you to use local models (via Ollama), OpenAI's API, or any other OpenAI-compatible service.
- Universal OpenAI API Compatibility: Exposes a
/v1/chat/completionsendpoint that fully mirrors the OpenAI API, allowing easy integration with tools expecting this format. Works with Ollama, OpenAI, Azure OpenAI, Anthropic, and any other OpenAI-compatible service. Tested with Roo cline, continue.dev, open webui. - Dual-Model Critique Pipeline: An initial response is generated by a primary LLM. A second LLM (the critique model) then refines this response. The critique model is guided by a detailed system prompt and the full context of the original request, enabling it to improve accuracy, adhere to formatting, and produce output suitable for direct IDE/tool consumption.
- Official OpenAI Pydantic Models: Leverages Pydantic models from the official
openaiPython library (v1.0+) for request validation and response structuring, ensuring type safety and compatibility. - Flexible Backend Support: Configure any OpenAI-compatible API endpoint - local Ollama, OpenAI's API, Azure OpenAI, or custom LLM services.
- Dockerized: Runs as a Docker container using Docker Compose, with optional Ollama service for managing local LLMs.
- Configurable: Uses environment variables for easy configuration of models, prompts, and logging.
- Streaming Responses: Supports the OpenAI
streamparameter and returns Server-Sent Events (SSE), enabling real-time token streaming in Continue.dev, OpenWebUI, and other clients.
It's important to note that the critique model (Model 2), guided by the CRITIQUE_SYSTEM_PROMPT, typically provides the core refined content. For instance, if Model 1's response includes conversational preamble or specific structural formatting (like markdown code blocks with file paths as seen in some IDE prompts), Model 2's refined output will often be the essential content itself (e.g., just the code), having stripped away the surrounding elements. This is by design to provide a clean, direct output for tools that consume the API.
- Python 3.11+ (pytest for testing)
- Docker
- Docker Compose (V2 recommended, i.e.,
docker composecommand)
Create a .env file in the project root by copying .env.example (if provided) or creating it from scratch. Populate it with the following variables:
OPENAI_BASE_URL: The base URL for any OpenAI-compatible service. This should behttp://your_openai_service:11434/v1(for Ollama) orhttps://api.openai.com/v1(for OpenAI).PRIMARY_MODEL_NAME: The name of the model to be used for generating the initial response (e.g.,llama3.2,gpt-4).CRITIQUE_MODEL_NAME: The name of the Ollama model to be used for critiquing the initial response (e.g.,deepseek-r1:8b). If commented out or empty, the critique step will be skipped.CRITIQUE_SYSTEM_PROMPT: The detailed system prompt used to guide the critique model. This prompt instructs Model 2 on how to analyze Model 1's response in the context of the entire original user request, focusing on correctness, completeness, and adherence to any implicit or explicit formatting requirements from the original request. The goal is for Model 2 to produce a polished, final response. The actual prompt is multi-line and should be defined in your.envfile (see example below).LOG_LEVEL: The logging level for the application (e.g.,INFO,DEBUG). Defaults toINFO.
Example .env file:
OPENAI_BASE_URL="http://localhost:11434/v1"
PRIMARY_MODEL_NAME="llama3.2"
CRITIQUE_MODEL_NAME="deepseek-r1:8b"
CRITIQUE_SYSTEM_PROMPT="You are now in a critique and refinement phase.\nBased on the entire preceding conversation, including the user's original request and the last AI's response:\n1. Identify areas for improvement in the LAST AI's response. Focus on:\n - Correcting bugs, syntax errors, and typos.\n - Addressing logic issues.\n - Enhancing clarity, conciseness, and overall quality.\n - Ensuring the response fully addresses the user's original query.\n2. Provide a revised and improved response.\n3. CRUCIAL: Your revised response MUST strictly adhere to any output formatting, structural requirements, or specific instructions implied by the user's original request(s) earlier in the conversation.\nYour goal is to produce a polished version suitable for direct use by the user's IDE/tool.\nPlease provide ONLY the final, refined response according to these instructions."
LOG_LEVEL="INFO"
OPENAI_BASE_URL="http://localhost:11434/v1"
PRIMARY_MODEL_NAME="llama3.2"
CRITIQUE_MODEL_NAME="deepseek-r1:8b"
OPENAI_BASE_URL="https://api.openai.com/v1"
PRIMARY_MODEL_NAME="gpt-4"
CRITIQUE_MODEL_NAME="gpt-4"
OPENAI_BASE_URL="https://your-resource.openai.azure.com/openai/deployments/your-deployment/chat/completions?api-version=2023-12-01-preview"
PRIMARY_MODEL_NAME="gpt-4"
CRITIQUE_MODEL_NAME="gpt-4"
OPENAI_BASE_URL="https://api.anthropic-proxy.com/v1"
PRIMARY_MODEL_NAME="claude-3-sonnet-20240229"
CRITIQUE_MODEL_NAME="claude-3-haiku-20240307"
- Ensure Docker is running.
- Navigate to the project root directory in your terminal.
- Build and start the services using Docker Compose:
This command will build the
docker compose up -d --build
llm_proxy_serviceimage and start it. - The LLM proxy service will be available at
http://localhost:3101(or the port configured indocker-compose.yaml).
To stop the services:
docker compose down-
GET /health: Basic liveness probe. -
GET /v1/health: OpenAI-style health probe used by Continue.dev (returns the same{"status":"ok"}). -
GET /v1/models: Lists available model IDs in OpenAI List Models format. -
POST /v1/chat/completions: OpenAI-compatible chat completions endpoint.- Request Body: Follows the OpenAI ChatCompletion API schema (e.g.,
model,messagesarray). - Response Body: Follows the OpenAI ChatCompletion API schema, including choices and usage (usage stats are currently placeholders).
- Request Body: Follows the OpenAI ChatCompletion API schema (e.g.,
-
POST /chat/completions: Alias without/v1prefix for clients that call it directly.
The proxy fully supports streaming in the same way as the OpenAI API. Set "stream": true in your request and the response will be delivered as a text/event-stream where each line begins with data: followed by a JSON chunk. The stream terminates with data: [DONE].
Example curl request:
curl -X POST "http://localhost:3101/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "hydracodedone",
"stream": true,
"messages": [
{"role": "user", "content": "What is Ruby?"}
]
}'Unit and integration tests (covering both non-streaming and streaming code paths) are written using pytest. The suite mocks Ollama HTTP calls with respx and patches streaming generators for fast, isolated testing.
- Ensure your virtual environment is activated and development dependencies are installed:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt - Run tests from the project root:
Or, if your
.venv/bin/python -m pytest
PATHis set up correctly after activating the venv:pytest