VLM Node

VLM Node is designed to process images captured by smart glasses in retail environments. By leveraging Vision Language Models (VLM) and Large Language Models (LLM), it intelligently determines when specific tasks start and end within the store, enabling automated task tracking and analysis.

Community

Discord

Features

Vision Language Model Integration: Uses Ollama for image analysis
Job Queue System: Asynchronous job processing with PostgreSQL backend
REST API: HTTP API for job submission and status tracking
Docker Support: Full containerization with Docker Compose
Kubernetes Ready: Helm charts for production deployment

Prerequisites

Before you begin, make sure you have the following installed on your system:

Docker and Docker Compose
Ollama (for running local language/vision models)

Configuration

Environment Variables

Variable	Description	Default	Required
`POSTGRES_URL`	PostgreSQL connection string	-	Yes
`VLM_MODEL`	Vision model used for analyzing images and detecting task events	`moondream:1.8b`	Yes
`LLM_MODEL`	Language model used for interpreting and reasoning about detected events	`llama3:latest`	Yes
`OLLAMA_HOST`	Ollama server URL	`http://localhost:11434`	Yes
`DATA_DIR`	Directory for storing data	-	Yes
`API_URL`	External API URL	-	Yes
`DDS_URL`	Data delivery service URL	-	Yes
`CLIENT_ID`	Client identifier, any string that helps us identify you	`vlm-node`	Yes
`POSEMESH_EMAIL`	Email for external service	-	Yes
`POSEMESH_PASSWORD`	Password for external service	-	Yes
`IMAGE_BATCH_SIZE`	Number of images to process in batch	`5`	No

Model Configuration

The system supports various Ollama models. https://ollama.com/search, check model input.

To use different models, update the VLM_MODEL and LLM_MODEL environment variables.

Production Considerations

Use a managed PostgreSQL database
Set up proper SSL/TLS certificates
Configure resource limits and requests
Set up monitoring and logging
Use secrets management for sensitive data
Configure backup strategies

Quick Start

Using Docker Compose (Recommended)

Clone the repository:

git clone git@github.com:aukilabs/vlm-node.git
cd vlm-node

Set up environment variables: Create a .env.local file with your configuration:

POSEMESH_EMAIL=
POSEMESH_PASSWORD=

Start all services:

(Optional)Install the NVIDIA Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation

make docker-cpu
# or
make docker-gpu

Note:
shm size should at least match the model size, ideally 1.5–2× for safety

In docker-compose.yml, adjust the shm_size parameter under the ollama-gpu service. For example:
  ollama-gpu:
    ...
    shm_size: 16gb
Set this value according to the requirements of the models you intend to use. Insufficient shared memory may cause model loading or inference to fail.

Verify the setup:

# Check if services are running
docker compose ps

# Test the API
curl http://localhost:8080/api/v1/jobs?limit=10

Local Development

1. Start the Server

# start ollama for cpu only by default
make server

2. Start the Worker

# start ollama for cpu only by default
make worker

Usage

Submitting Jobs

Submit a job using the REST API:

curl -X POST http://localhost:8080/api/v1/jobs \
    -H "Content-Type: application/json" \
    -d '{
        "job_type": "task_timing_v1",
        "query": {"ids": []},
        "domain_id": "",
        "input": {
            "prompt": "Analyze this image for task completion",
            "webhook_url": "",
            "vlm_prompt": "Describe what you see in this image"
        }
    }'

Checking Job Status

# List all jobs
curl "http://localhost:8080/api/v1/jobs?limit=100"

# Get specific job details
curl "http://localhost:8080/api/v1/jobs/{job_id}"

Real-Time Image Inference

You can perform real-time image inference by connecting to the WebSocket endpoint at ws://localhost:8080/api/v1/ws (or wss://domain.com/api/v1/ws for secure connections).

Note: None of the images or results are persisted.

Protocol Overview:

Image Upload: Send image data as binary messages over the WebSocket. The server will process images in batches of size IMAGE_BATCH_SIZE or after a 10-second timeout, whichever comes first.
Prompt Submission: Send the prompt as a UTF-8 encoded text message.
Server Response: The server returns inference results as binary WebSocket messages containing a JSON object: {"done": <bool>, "response": <string>}

Note:
If your client is not written in JavaScript, you must also respond to pong messages from the server to keep the connection alive.

Example (JavaScript):

let websocketInstance: WebSocket | null = null;

export function initializeWebSocket(): WebSocket {
    const url = process.env.COMPUTE_NODE_URL;
    if (!url) {
        throw new Error("COMPUTE_NODE_URL environment variable is not set");
    }
    if (websocketInstance) {
        return websocketInstance;
    }
    websocketInstance = new WebSocket(url);
    console.log("WebSocket URL: ", url);

    websocketInstance.onopen = () => {
        console.log("WebSocket connected");
        websocketInstance.send("Describe the art work you see in the photo.");
    }

    let response = "";
    websocketInstance.onmessage = (event) => {
        try {
            let bufferPromise: Promise<ArrayBuffer>;
            if (event.data instanceof ArrayBuffer) {
                bufferPromise = Promise.resolve(event.data);
            } else if (event.data instanceof Blob) {
                bufferPromise = event.data.arrayBuffer();
            } else {
                bufferPromise = Promise.resolve(new TextEncoder().encode(event.data).buffer);
            }

            bufferPromise.then((buffer) => {
                // Try to decode as UTF-8 string
                let text: string;
                try {
                    text = new TextDecoder("utf-8").decode(buffer);
                } catch (e) {
                    console.error("Failed to decode WebSocket binary message as UTF-8", e);
                    return;
                }
                // Try to parse as JSON
                try {
                    const parsed = JSON.parse(text);
                    if (
                        typeof parsed === "object" &&
                        parsed !== null &&
                        typeof parsed.response === "string" &&
                        typeof parsed.done === "boolean"
                    ) {
                        response += parsed.response;
                        if (parsed.done) {
                            console.log("Compute Node response done:", response);
                            response = "";
                        }
                    } else {
                        console.warn("Received message is not in expected format:", parsed);
                    }
                } catch (e) {
                    console.error("Failed to parse WebSocket message as JSON", e, "Raw text:", text);
                }
            });
        } catch (err) {
            console.error("Error handling WebSocket binary message", err);
        }
    }
    websocketInstance.onclose = () => {
        websocketInstance = null;
        console.log("WebSocket closed");
    }
    websocketInstance.onerror = (event) => {
        console.error("WebSocket error", event);
    }
    return websocketInstance;
}

export function sendPhotoToComputeNode(photo: PhotoData): void {
    if (!websocketInstance) {
        initializeWebSocket();
    }
    console.log("[STREAMING] Sending photo to Compute Node", photo.filename);
    websocketInstance.send(photo.buffer);
}

API Reference

Jobs Endpoint

GET /api/v1/jobs - List jobs
POST /api/v1/jobs - Create a new job
GET /api/v1/jobs/{id} - Get job details
PUT /api/v1/jobs/{id} - Retry a job

Troubleshooting

Common Issues

Server doesn't start: If the model is not already loaded into Ollama's memory, the server may need to pull the model first, which can take some time. To check the progress, run docker compose logs -f ollama-cpu and look for messages indicating that the model is being downloaded.

Logs

View logs for specific services:

docker compose logs server
docker compose logs worker
docker compose logs ui
docker compose logs postgres
docker compose logs ollama-gpu
docker compose logs ollama-cpu

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
charts/vlm-node		charts/vlm-node
server		server
ui		ui
worker		worker
.env		.env
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM Node

Community

Features

Prerequisites

Configuration

Environment Variables

Model Configuration

Production Considerations

Quick Start

Using Docker Compose (Recommended)

Local Development

1. Start the Server

2. Start the Worker

Usage

Submitting Jobs

Checking Job Status

Real-Time Image Inference

API Reference

Jobs Endpoint

Troubleshooting

Common Issues

Logs

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM Node

Community

Features

Prerequisites

Configuration

Environment Variables

Model Configuration

Production Considerations

Quick Start

Using Docker Compose (Recommended)

Local Development

1. Start the Server

2. Start the Worker

Usage

Submitting Jobs

Checking Job Status

Real-Time Image Inference

API Reference

Jobs Endpoint

Troubleshooting

Common Issues

Logs

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages