llama-cpp-connector

Super simple Python connectors for llama.cpp, including vision models (Gemma 3, Qwen2-VL). Compile llama.cpp and run!

Overview

This project provides lightweight Python connectors to easily interact with llama.cpp models, supporting both standard text models and multimodal vision models (currently Gemma 3 and Qwen2-VL). It creates a simple framework to build applications on top of llama.cpp while handling the complexity of model configuration, server management, and inference.

The idea behind it is simple: to offer a minimalistic environment for Python coders to directly interact with llama.cpp without intermediaries like Ollama or LMStudio. Just go to HuggingFace, download your models, use the connectors and have fun!

Features

🚀 Easy to use: Only two Python classes to interact with local LLMs: LlamaServerConnector and LlamaVisionConnector
🖼️ Vision model support: Ready-to-use connectors for Gemma 3 and Qwen2-VL vision models
🔄 OpenAI-compatible API: Use the LlamaServerConnector with the OpenAI Python client
⚙️ Configurable: simple JSON-based configuration for all model parameters
🐳 Docker ready: Build once, prepare your container, run docker commit and your LLM-powered app is done
🧠 Great for pros: A perfect sandbox for those familiar with llama.cpp!

Components

1. LlamaServerConnector (`llama_server_connector.py`)

This component provides an OpenAI-compatible server interface for text-based models:

Provides methods to start, manage and kill llama-server instances directly via Python
Automatically starts and manages a llama.cpp server
Finds available ports dynamically
Configures model parameters from JSON
Provides a simple API to send prompts and get responses
Compatible with the OpenAI Python client

2. LlamaVisionConnector (`llama_vision_connector.py`)

For multimodal vision models (Gemma 3, Qwen2-VL):

Process images with text prompts
Automatically handles configuration for different vision models
Asynchronous API for efficient processing
Supports custom prompts or prompt files

3. Docker Build System

Simplifies dependency management and deployment:

Multi-stage build for optimal image size
Automatically detects CUDA architecture
Sets up all required dependencies
Creates a ready-to-use environment with Python and libraries

How to Use

Installation

1. Docker Container (Recommended)

The easiest way to get started is with the Docker container.

A big part of why I built this was to have a very simple llama.cpp/Python sandbox that could follow the new releases of llama.cpp repository faster than bindings such as llama-cpp-python does.

The simple Dockerfile here should do the trick. (Important note: the Dockerfile assumes you are using NVIDIA GPUs.)

# Clone the repository
git clone https://github.com/yourusername/llama-cpp-connector.git
cd llama-cpp-connector

# Run the build script
chmod +x build-docker-container.sh
./build-docker-container.sh

# Run the container with GPU access
docker run --gpus all -it llama-cpp-connector:latest

The build-docker-container.sh script will automatically detect the CUDA compute capability of your GPU and compile llama.cpp specifically for your architecture, so the compilation is as fast as possible and the container is of minimal size.

2. Manual Installation

If you prefer not to use Docker:

Compile llama.cpp from source and make sure all binaries are in your PATH

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release
# Add binaries to your PATH

Install Python dependencies
```
pip install -r requirements.txt
```

Using the Server Connector

from llama_server_connector import LlamaServerConnector

# Initialize the server with a text model
connector = LlamaServerConnector(
    config_path="config/models.json",
    model_key="DEEPSEEK-R1-QWEN-14B"
)

# Get a response
##   Note: the get_response method only provides the response string.
##   If you want a OpenAI completions object, just send a openai request to the server at http://{host}:{self.urlport}/v1
response = connector.get_response("Explain quantum computing in simple terms")
print(response)

# When done
connector.kill_server()

Using the Vision Connector

import asyncio
from llama_vision_connector import LlamaVisionConnector

async def process_image():
    # Initialize with a vision model
    vision = LlamaVisionConnector(
        config_path="config/models.json",
        model_key="GEMMA3_12B"
    )
    
    # Process an image with default or custom prompt
    ##   Note: the get_response method only provides the response string.
    ##   Vision models in llama.cpp are currently handled via the CLI interface, so this is NOT an openai-compatible interaction.
    description = await vision.get_response(
        "path/to/image.jpg", 
        prompt="Describe this image in detail"
    )
    
    print(description)

# Run the async function
asyncio.run(process_image())

Configuration

Models are configured in config/models.json:

{
    "MODELS": {
        "GEMMA3_12B": {
            "CLI_CMD": "llama-gemma3-cli",
            "MODEL_PATH": "models/gemma-3-12b-it-Q6_K_L.gguf",      
            "MMPROJ_PATH": "models/mmproj-gemma3-12b-it-f32.gguf",
            "TEMPERATURE": 0.3,
            "NUM_LAYERS_TO_GPU": 99
            // Other parameters...
        },
        // Additional models...
    }
}

Docker Container

While optional, the Docker container is the heart of this project, providing:

Automatic CUDA detection: Builds with the optimal settings for your GPU
Compiled binaries: Builds llama.cpp from source with all optimizations
Python environment: Pre-configured with all (minimal) required libraries
Ready-to-use framework: Just add your models and start developing

Why Use the Docker Container?

Eliminates dependency issues: All libraries and tools are pre-installed
GPU-optimized: Builds specifically for your GPU architecture
Reproducible: Same environment on any machine
Easy model management: Just mount your model folder
Portable: Run anywhere Docker and CUDA are supported
Always pulls latest llama.cpp: No more waiting for llama-cpp-python to update!

To use your own models, just mount your models directory:

docker run --gpus all -v /path/to/your/models:/workspace/models -it llama-cpp-connector:latest

Adding Models

Place your GGUF model files in the models/ directory
Update config/models.json with your model configuration
Use the appropriate connector based on model type (text or vision)

Why llama-cpp-connector?

Simplified integration: No need to directly interface with llama.cpp's C++ code
Simplified configuration: Change model behavior just by editing a JSON script
Docker-first approach: Consistent environment across development and deployment
Minimal dependencies: Just have llama.cpp binaries and you're done
OpenAI-compatible: Use familiar APIs with local models

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-cpp-connector

Overview

Features

Components

1. LlamaServerConnector (`llama_server_connector.py`)

2. LlamaVisionConnector (`llama_vision_connector.py`)

3. Docker Build System

How to Use

Installation

1. Docker Container (Recommended)

2. Manual Installation

Using the Server Connector

Using the Vision Connector

Configuration

Docker Container

Why Use the Docker Container?

Adding Models

Why llama-cpp-connector?

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
models		models
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build-docker-container.sh		build-docker-container.sh
llama_server_connector.py		llama_server_connector.py
llama_vision_connector.py		llama_vision_connector.py
requirements.txt		requirements.txt
vision-prompt.txt		vision-prompt.txt

Folders and files

Latest commit

History

Repository files navigation

llama-cpp-connector

Overview

Features

Components

1. LlamaServerConnector (llama_server_connector.py)

2. LlamaVisionConnector (llama_vision_connector.py)

3. Docker Build System

How to Use

Installation

1. Docker Container (Recommended)

2. Manual Installation

Using the Server Connector

Using the Vision Connector

Configuration

Docker Container

Why Use the Docker Container?

Adding Models

Why llama-cpp-connector?

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. LlamaServerConnector (`llama_server_connector.py`)

2. LlamaVisionConnector (`llama_vision_connector.py`)

Packages