Ovllm

Ovllm is a user-friendly wrapper around vLLM that provides Ollama-like simplicity for downloading and serving LLM models from HuggingFace. It's designed to work seamlessly with OpenWebUI.

Features

🚀 One-command model downloads - Pull models directly from HuggingFace Hub
🎯 Ollama-like CLI - Familiar commands: ovllm pull, ovllm run, ovllm serve
🔌 OpenWebUI Ready - Drop-in replacement for Ollama backend
📦 Model Management - List, remove, and manage downloaded models
⚡ vLLM Performance - Full vLLM inference engine with continuous batching
🛠️ OpenAI Compatible - Serve models via OpenAI-compatible API

Images of it Working

Typing HF Model Name

Selecting loaded model

Fake Error (Working on it)

Terminal Images

Installation

Prerequisites

Python 3.9+
CUDA-capable GPU (for GPU inference)
HuggingFace account (for some models)

Install from source

# Clone the repository
git clone https://github.com/FearL0rd/ovllm.git
cd ovllm

# Install dependencies
pip install -r requirements.txt

# Install Ovllm
pip install -e .

Verify installation

# Run test script
python -m ovllm --help

# Or test imports
python tests/test_install.py

Quick Start

1. Download and run a model interactively

ovllm run meta-llama/Llama-2-7b-chat-hf

This will automatically download the model if not present, then start an interactive chat session.

2. Pull a model for later use

ovllm pull meta-llama/Llama-2-7b-chat-hf

3. Start the API server

ovllm serve

4. Connect to OpenWebUI

Set OpenWebUI's Ollama URL to http://localhost:11434 or use Docker:

docker run -d \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

CLI Commands

Command	Description
`ovllm run <model>`	Run a model interactively (auto-downloads if missing)
`ovllm pull <model>`	Download a model from HuggingFace
`ovllm serve`	Start the API server
`ovllm list`	List downloaded models
`ovllm rm <model>`	Remove a downloaded model
`ovllm show <model>`	Show model details
`ovllm ps`	Show running models

Model Sources

Ovllm supports any model from HuggingFace Hub:

# Meta Llama models
ovllm pull meta-llama/Llama-2-7b-chat-hf

# Mistral AI models
ovllm pull mistralai/Mistral-7B-Instruct-v0.3

# Google Gemma models
ovllm pull google/gemma-7b-it

# Qwen models
ovllm pull Qwen/Qwen2.5-7B-Instruct

# Local models
ovllm run /path/to/local/model

GGUF Models (Quantized)

Ovllm supports GGUF models with specific quantization levels. Specify the quantization suffix after the model ID:

# Llama-3.2 3B with Q4_K_M quantization
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M

# Run with specific quantization
ovllm run bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M

# Other popular quantizations
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q5_K_M
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q6_K
ovllm pull bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

Available quantizations depend on the specific GGUF model repository. Common options include:

Q4_K_M - Good balance between size and quality
Q5_K_M - Better quality, larger size
Q6_K - High quality
Q8_0 - Near lossless quality

Configuration

Set environment variables or use CLI flags:

General Configuration

Variable	Description	Default
`OVLLM_HOST`	Server host	`0.0.0.0`
`OVLLM_PORT`	Server port	`11434`
`OVLLM_MODELS_DIR`	Model storage directory	`~/.ovllm/models`
`OVLLM_LOG_LEVEL`	Logging level	`INFO`

vLLM Configuration

Variable	Description	Default
`OVLLM_GPU_MEMORY`	GPU memory utilization (0.0-1.0)	`0.9`
`OVLLM_TENSOR_PARALLEL_SIZE`	Number of GPUs to use for tensor parallelism	`1`
`OVLLM_CPU_OFFLOAD_GB`	Max system RAM (in GiB) to use per GPU for offloading	`0.0`
`OVLLM_MAX_MODEL_LEN`	Maximum sequence length the model can handle	`None` (model default)
`OVLLM_MAX_NUM_SEQS`	Maximum number of sequences per iteration	`256`
`OVLLM_MAX_TOKENS`	Maximum number of tokens to generate	`256`

HuggingFace Authentication

Some models (like Meta's Llama series) require authentication and license acceptance:

Create a HuggingFace account at https://huggingface.io/login
Accept the model license (for gated models like Llama):
- Visit the model page (e.g., https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
- Click "Agree and access"
Get your access token:
- Go to https://huggingface.co/settings/tokens
- Create a new token with "read" permissions
- Copy the token

Set the token as an environment variable:

# Linux/macOS
export HF_TOKEN=hf_xxxxx

# Windows PowerShell
$env:HF_TOKEN="hf_xxxxx"

# Windows CMD
set HF_TOKEN=hf_xxxxx

Run ovllm:

ovllm pull meta-llama/Llama-3.2-3B-Instruct

API Endpoints

Ovllm provides an OpenAI-compatible API:

Endpoint	Description
`POST /v1/chat/completions`	Chat completion
`POST /v1/completions`	Text completion
`GET /v1/models`	List available models
`POST /api/pull`	Pull a model
`GET /api/tags`	List local models
`DELETE /api/delete`	Delete a model
`POST /api/generate`	Generate completion (Ollama-compatible)
`POST /api/chat`	Chat completion (Ollama-compatible)

Example Usage

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

cURL

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Docker Deployment

Using Docker Compose (recommended)

# Start Ovllm + OpenWebUI
docker-compose up -d

Manual Docker

# Build image
docker build -t ovllm .

# Run Ovllm
docker run -d \
  -p 11434:11434 \
  -v ovllm_models:/root/.ovllm/models \
  --gpus all \
  ovllm

# Run OpenWebUI
docker run -d \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Ovllm CLI                                │
│  run / pull / serve / list / rm / show                     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Model Manager                               │
│  - Downloads from HuggingFace                               │
│  - Manages local cache (~/.ovllm/models)                   │
│  - Handles authentication                                   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    vLLM Engine                               │
│  - AsyncLLMEngine                                           │
│  - Continuous batching                                      │
│  - PagedAttention                                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              OpenAI-Compatible API                          │
│  - /v1/chat/completions                                     │
│  - /v1/completions                                          │
│  - /v1/models                                               │
└─────────────────────────────────────────────────────────────┘

Troubleshooting

Model not found

Ensure you have the correct model ID from HuggingFace Hub:

ovllm pull meta-llama/Llama-2-7b-chat-hf

Out of memory

Reduce GPU memory usage:

export OVLLM_GPU_MEMORY=0.7
ovllm serve

Authentication required

For private models, set your HuggingFace token:

export HF_TOKEN=hf_xxx
ovllm pull private-model

License

Apache 2.0

Acknowledgments

vLLM for the inference engine
Ollama for the UX inspiration
HuggingFace for the model hub

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
images		images
ovllm		ovllm
tests		tests
Dockerfile		Dockerfile
LICENSE		LICENSE
PROJECT.md		PROJECT.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
quickstart.py		quickstart.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Ovllm

Features

Images of it Working

Installation

Prerequisites

Install from source

Verify installation

Quick Start

1. Download and run a model interactively

2. Pull a model for later use

3. Start the API server

4. Connect to OpenWebUI

CLI Commands

Model Sources

GGUF Models (Quantized)

Configuration

General Configuration

vLLM Configuration

HuggingFace Authentication

HuggingFace Authentication

API Endpoints

Example Usage

Python Client

cURL

Docker Deployment

Using Docker Compose (recommended)

Manual Docker

Architecture

Troubleshooting

Model not found

Out of memory

Authentication required

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages