Ovllm is a user-friendly wrapper around vLLM that provides Ollama-like simplicity for downloading and serving LLM models from HuggingFace. It's designed to work seamlessly with OpenWebUI.
- 🚀 One-command model downloads - Pull models directly from HuggingFace Hub
- 🎯 Ollama-like CLI - Familiar commands:
ovllm pull,ovllm run,ovllm serve - 🔌 OpenWebUI Ready - Drop-in replacement for Ollama backend
- 📦 Model Management - List, remove, and manage downloaded models
- ⚡ vLLM Performance - Full vLLM inference engine with continuous batching
- 🛠️ OpenAI Compatible - Serve models via OpenAI-compatible API
- Typing HF Model Name
- Selecting loaded model
- Fake Error (Working on it)
- Terminal Images
- Python 3.9+
- CUDA-capable GPU (for GPU inference)
- HuggingFace account (for some models)
# Clone the repository
git clone https://github.com/FearL0rd/ovllm.git
cd ovllm
# Install dependencies
pip install -r requirements.txt
# Install Ovllm
pip install -e .# Run test script
python -m ovllm --help
# Or test imports
python tests/test_install.pyovllm run meta-llama/Llama-2-7b-chat-hfThis will automatically download the model if not present, then start an interactive chat session.
ovllm pull meta-llama/Llama-2-7b-chat-hfovllm serveSet OpenWebUI's Ollama URL to http://localhost:11434 or use Docker:
docker run -d \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main| Command | Description |
|---|---|
ovllm run <model> |
Run a model interactively (auto-downloads if missing) |
ovllm pull <model> |
Download a model from HuggingFace |
ovllm serve |
Start the API server |
ovllm list |
List downloaded models |
ovllm rm <model> |
Remove a downloaded model |
ovllm show <model> |
Show model details |
ovllm ps |
Show running models |
Ovllm supports any model from HuggingFace Hub:
# Meta Llama models
ovllm pull meta-llama/Llama-2-7b-chat-hf
# Mistral AI models
ovllm pull mistralai/Mistral-7B-Instruct-v0.3
# Google Gemma models
ovllm pull google/gemma-7b-it
# Qwen models
ovllm pull Qwen/Qwen2.5-7B-Instruct
# Local models
ovllm run /path/to/local/modelOvllm supports GGUF models with specific quantization levels. Specify the quantization suffix after the model ID:
# Llama-3.2 3B with Q4_K_M quantization
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
# Run with specific quantization
ovllm run bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
# Other popular quantizations
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q5_K_M
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q6_K
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0Available quantizations depend on the specific GGUF model repository. Common options include:
Q4_K_M- Good balance between size and qualityQ5_K_M- Better quality, larger sizeQ6_K- High qualityQ8_0- Near lossless quality
Set environment variables or use CLI flags:
| Variable | Description | Default |
|---|---|---|
OVLLM_HOST |
Server host | 0.0.0.0 |
OVLLM_PORT |
Server port | 11434 |
OVLLM_MODELS_DIR |
Model storage directory | ~/.ovllm/models |
OVLLM_LOG_LEVEL |
Logging level | INFO |
| Variable | Description | Default |
|---|---|---|
OVLLM_GPU_MEMORY |
GPU memory utilization (0.0-1.0) | 0.9 |
OVLLM_TENSOR_PARALLEL_SIZE |
Number of GPUs to use for tensor parallelism | 1 |
OVLLM_CPU_OFFLOAD_GB |
Max system RAM (in GiB) to use per GPU for offloading | 0.0 |
OVLLM_MAX_MODEL_LEN |
Maximum sequence length the model can handle | None (model default) |
OVLLM_MAX_NUM_SEQS |
Maximum number of sequences per iteration | 256 |
OVLLM_MAX_TOKENS |
Maximum number of tokens to generate | 256 |
Some models (like Meta's Llama series) require authentication and license acceptance:
-
Create a HuggingFace account at https://huggingface.io/login
-
Accept the model license (for gated models like Llama):
- Visit the model page (e.g., https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
- Click "Agree and access"
-
Get your access token:
- Go to https://huggingface.co/settings/tokens
- Create a new token with "read" permissions
- Copy the token
-
Set the token as an environment variable:
# Linux/macOS export HF_TOKEN=hf_xxxxx # Windows PowerShell $env:HF_TOKEN="hf_xxxxx" # Windows CMD set HF_TOKEN=hf_xxxxx
-
Run ovllm:
ovllm pull meta-llama/Llama-3.2-3B-Instruct
Ovllm provides an OpenAI-compatible API:
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat completion |
POST /v1/completions |
Text completion |
GET /v1/models |
List available models |
POST /api/pull |
Pull a model |
GET /api/tags |
List local models |
DELETE /api/delete |
Delete a model |
POST /api/generate |
Generate completion (Ollama-compatible) |
POST /api/chat |
Chat completion (Ollama-compatible) |
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'# Start Ovllm + OpenWebUI
docker-compose up -d# Build image
docker build -t ovllm .
# Run Ovllm
docker run -d \
-p 11434:11434 \
-v ovllm_models:/root/.ovllm/models \
--gpus all \
ovllm
# Run OpenWebUI
docker run -d \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main┌─────────────────────────────────────────────────────────────┐
│ Ovllm CLI │
│ run / pull / serve / list / rm / show │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Model Manager │
│ - Downloads from HuggingFace │
│ - Manages local cache (~/.ovllm/models) │
│ - Handles authentication │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ vLLM Engine │
│ - AsyncLLMEngine │
│ - Continuous batching │
│ - PagedAttention │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ OpenAI-Compatible API │
│ - /v1/chat/completions │
│ - /v1/completions │
│ - /v1/models │
└─────────────────────────────────────────────────────────────┘
Ensure you have the correct model ID from HuggingFace Hub:
ovllm pull meta-llama/Llama-2-7b-chat-hfReduce GPU memory usage:
export OVLLM_GPU_MEMORY=0.7
ovllm serveFor private models, set your HuggingFace token:
export HF_TOKEN=hf_xxx
ovllm pull private-modelApache 2.0
- vLLM for the inference engine
- Ollama for the UX inspiration
- HuggingFace for the model hub





