A lightweight Rust proxy for Ollama that intelligently adjusts request parameters to match each model's training configuration.
Some AI clients (like Elephas) send the same context length parameter for all models. This causes issues when:
- Embedding models trained with 8K context receive requests for 128K context
- Ollama warns: "requested context size too large for model"
- Models may perform poorly with incorrect parameters
This proxy sits between your client and Ollama, automatically:
- Detects which model is being requested
- Fetches the model's training context length (
n_ctx_train) - Adjusts
num_ctxif it exceeds the model's capabilities - Provides detailed logging of all modifications
- ✅ Prevents infinite generation - Auto-injects
num_predictto limit output - ✅ Smart chunking - Automatically splits large embeddings inputs to prevent crashes
- ✅ Context safety caps - Configurable hard limits to prevent Ollama stalls
- ✅ Request timeouts - Prevents indefinite hangs with configurable limits
- ✅ Automatic parameter correction based on model metadata
- ✅ Request/response logging for debugging
- ✅ Model metadata caching for performance
- ✅ Extensible modifier framework for future enhancements
- ✅ Zero configuration for basic usage
cargo build --release# Default: Listen on 127.0.0.1:11435, proxy to 127.0.0.1:11434
cargo run --release
# Or with custom settings:
OLLAMA_HOST=http://127.0.0.1:11434 PROXY_PORT=11435 RUST_LOG=info cargo run --releasePoint your AI client (Elephas, etc.) to the proxy instead of Ollama directly:
Before: http://127.0.0.1:11434
After: http://127.0.0.1:11435
The proxy will log all requests and modifications:
📨 Incoming request: POST /v1/embeddings
📋 Request body: {
"model": "nomic-embed-text",
"input": "test",
"options": {
"num_ctx": 131072
}
}
🔍 Detected model: nomic-embed-text
📊 Model metadata - n_ctx_train: 8192
⚠️ num_ctx (131072) exceeds model training context (8192)
✏️ Modified options.num_ctx: 131072 → 8192
🔧 ContextLimitModifier applied modifications
📬 Response status: 200 OK
Environment variables:
OLLAMA_HOST- Target Ollama server (default:http://127.0.0.1:11434)PROXY_PORT- Port to listen on (default:11435)RUST_LOG- Log level:error,warn,info,debug,trace(default:info)
Prevent Ollama stalls with large contexts:
MAX_CONTEXT_OVERRIDE- Hard cap for context size regardless of model support (default:16384)REQUEST_TIMEOUT_SECONDS- Timeout for requests to Ollama (default:120)
Why This Matters:
Models may claim to support very large contexts (e.g., 131K tokens), but Ollama can stall or hang when actually processing them, especially with flash attention enabled. The MAX_CONTEXT_OVERRIDE provides a safety limit.
Recommended Settings:
# Conservative (most reliable)
MAX_CONTEXT_OVERRIDE=16384 REQUEST_TIMEOUT_SECONDS=120 cargo run --release
# Moderate (test with your hardware)
MAX_CONTEXT_OVERRIDE=32768 REQUEST_TIMEOUT_SECONDS=180 cargo run --release
# Aggressive (may cause stalls on some systems)
MAX_CONTEXT_OVERRIDE=65536 REQUEST_TIMEOUT_SECONDS=300 cargo run --releaseNote: If requests time out, reduce MAX_CONTEXT_OVERRIDE first before increasing timeout.
THE CRITICAL FIX FOR TIMEOUTS:
The proxy automatically injects num_predict into all chat requests to prevent infinite generation loops.
The Problem:
- Ollama's default
num_predictis -1 (infinite) - Without this parameter, models generate until they fill entire context
- This causes "stalls" even with small contexts (4K)
- The model isn't stuck - it's generating millions of unwanted tokens!
How the Proxy Fixes This:
- Detects chat requests (those with
messagesarray) - Checks if
num_predictis already set - If not set, injects
num_predict:- Uses
max_tokensfrom request if available (e.g., 4096 from Elephas) - Otherwise defaults to 4096 tokens
- Uses
- Logs the injection for transparency
Example:
// Your request:
{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 2048
}
// Proxy automatically adds:
{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 2048,
"options": {
"num_predict": 2048 // ← Added by proxy
}
}Why This Matters:
Without num_predict, a simple "say hello" request can generate for 3+ minutes, filling the entire context buffer with elaborations, examples, and repetitions until it crashes or times out.
Override if Needed:
If you want different generation limits, set num_predict explicitly in your request - the proxy preserves existing values.
For large embeddings inputs, the proxy can automatically chunk text to prevent Ollama memory errors:
MAX_EMBEDDING_INPUT_LENGTH- Maximum characters per embedding input (default:2000)ENABLE_AUTO_CHUNKING- Enable automatic chunking for large inputs (default:true)
How Chunking Works:
When an embeddings request contains text longer than MAX_EMBEDDING_INPUT_LENGTH:
- The proxy splits the text into smaller chunks (with 10% overlap for context preservation)
- Each chunk is sent as a separate request to Ollama sequentially
- The proxy collects all embedding vectors
- Embeddings are averaged to create a single combined embedding
- The client receives one response, transparently
Example:
# Allow larger inputs before chunking (4000 characters)
MAX_EMBEDDING_INPUT_LENGTH=4000 cargo run --release
# Disable chunking (return error for large inputs)
ENABLE_AUTO_CHUNKING=false cargo run --releasePerformance Considerations:
- Chunking processes sequentially to avoid memory pressure
- A 10,000 character input with 2000 char limit creates ~5 chunks
- Each chunk adds ~200-500ms latency (model dependent)
- For best performance, keep inputs under the limit when possible
Flash Attention is an optimization technique that speeds up inference and reduces memory usage. Ollama can enable it automatically for supported models.
Flash Attention is global only (environment variable), not per-request:
# Let Ollama decide (RECOMMENDED - unset the variable)
unset OLLAMA_FLASH_ATTENTION
ollama serve
# Explicitly enable (may cause issues with large contexts)
export OLLAMA_FLASH_ATTENTION=1
ollama serve
# Explicitly disable (may help with large context stalls)
export OLLAMA_FLASH_ATTENTION=0
ollama serveSymptoms:
- Requests with large contexts (>60K tokens) stall indefinitely
- GPU shows "100% allocated" but 0% utilization in Activity Monitor
- Ollama process is running but not responding
- Client times out without receiving response
Why This Happens: Flash attention with very large contexts can trigger memory allocation deadlocks or exceed Metal's working set limits on macOS, especially with M-series chips.
Solutions:
-
Unset flash attention (let Ollama decide per-model):
unset OLLAMA_FLASH_ATTENTION pkill ollama ollama serve -
Reduce context size (use the proxy's safety cap):
MAX_CONTEXT_OVERRIDE=16384 cargo run --release
-
Test systematically to find your hardware's limits:
./test_context_limits.sh gpt-oss:20b
✅ DO:
- Keep
OLLAMA_FLASH_ATTENTIONunset (let Ollama auto-detect) - Use
MAX_CONTEXT_OVERRIDE=16384for reliability - Test with
test_context_limits.shto find your system's sweet spot - Monitor GPU utilization when testing large contexts
❌ DON'T:
- Set flash attention to
falseglobally (disables it for all models) - Use contexts >60K without testing first
- Assume model's claimed context limit works reliably in practice
Symptoms:
- Embeddings requests return HTTP 500
- Ollama logs show
SIGABRT: abortoroutput_reserve: reallocating output buffer - Error occurs with large text inputs (> 5000 characters)
Cause: Ollama's embedding models crash when trying to allocate large buffers for very long inputs.
Solutions:
-
Enable chunking (should be on by default):
ENABLE_AUTO_CHUNKING=true cargo run --release
-
Reduce chunk size if still seeing errors:
MAX_EMBEDDING_INPUT_LENGTH=1500 cargo run --release
-
Check Ollama logs for details:
tail -f ~/.ollama/logs/server.log
Symptoms:
- Request returns HTTP 400
- Error message: "Input too large (X characters). Maximum is Y characters."
Cause:
Input exceeds MAX_EMBEDDING_INPUT_LENGTH and chunking is disabled.
Solution: Enable chunking:
ENABLE_AUTO_CHUNKING=true cargo run --releaseSymptoms:
- Embeddings take much longer than expected
- Logs show "Processing X chunks sequentially"
Cause: Large inputs are being chunked and processed sequentially.
This is expected behavior! Chunking prevents crashes but adds latency.
To improve speed:
- Reduce input size at the source
- Increase
MAX_EMBEDDING_INPUT_LENGTHif your hardware can handle it - Use a smaller/faster embeddings model
- Intercept: Proxy receives request from client
- Detect API Format: Determine if request uses OpenAI or native Ollama API
- Translate (if needed): Convert OpenAI
/v1/embeddings→ Ollama/api/embed - Fetch Metadata: Query Ollama API for model's training parameters
- Inject Parameters: Add
options.num_ctxwith correct value for the model - Forward: Send request to Ollama native API (which accepts options)
- Translate Response: Convert Ollama response back to OpenAI format
- Return: Pass OpenAI-compatible response back to client
Client (Elephas)
↓ OpenAI API format (/v1/embeddings)
Proxy (Port 11435)
↓ Translates to native Ollama API (/api/embed)
↓ Injects options.num_ctx based on model
Ollama (Port 11434)
↓ Returns native response
Proxy
↓ Translates back to OpenAI format
Client receives OpenAI-compatible response
Key Innovation: The proxy acts as a translation layer, converting between OpenAI's API format (which doesn't support runtime options) and Ollama's native API (which does), enabling per-request parameter control without changing global settings.
The modifier framework is designed for easy extension:
pub trait ParameterModifier {
fn modify(&self, json: &mut Value, metadata: &ModelMetadata) -> bool;
fn name(&self) -> &str;
}Add new modifiers in src/modifier.rs and register them in apply_modifiers().
cargo testMIT