Shimmy provides multiple API interfaces for local LLM inference.
Endpoint: POST /api/generate
Request Body:
{
"model": "string", // Model name (required)
"prompt": "string", // Input prompt (required)
"max_tokens": 100, // Maximum tokens to generate (optional, default: 100)
"temperature": 0.7, // Sampling temperature (optional, default: 0.7)
"stream": false // Enable streaming response (optional, default: false)
}Non-Streaming Response:
{
"choices": [
{
"text": "Generated text response",
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 20,
"total_tokens": 30
}
}Streaming Response: Server-Sent Events with data chunks:
data: {"choices":[{"text":"Hello","index":0}]}
data: {"choices":[{"text":" world","index":0}]}
data: [DONE]
Endpoint: GET /api/models
Response:
{
"models": [
{
"id": "default",
"name": "Default Model",
"description": "Base GGUF model"
}
]
}Endpoint: GET /api/health
Response:
{
"status": "healthy",
"models_loaded": 1,
"memory_usage": "2.1GB"
}Endpoint: ws://localhost:11435/ws/generate
{
"model": "default",
"prompt": "Hello world",
"max_tokens": 50,
"temperature": 0.7
}{"token": "Hello"}
{"token": " world"}
{"done": true}# Start server
shimmy serve --bind 127.0.0.1:11435 --port 11435
# Generate text
shimmy generate --prompt "Hello" --max-tokens 50 --temperature 0.7
# List available models
shimmy list
# Probe model loading
shimmy probe [model-name]
# Show diagnostics
shimmy diag--verbose, -v: Enable verbose logging--help, -h: Show help information--version, -V: Show version information
All endpoints return consistent error formats:
{
"error": {
"code": "model_not_found",
"message": "The specified model was not found",
"details": "Model 'invalid-model' is not available"
}
}Common error codes:
model_not_found: Requested model is not availableinvalid_request: Request format is invalidgeneration_failed: Text generation failedserver_error: Internal server error
Currently no rate limiting is implemented. For production use, consider placing shimmy behind a reverse proxy with rate limiting capabilities.