Skip to content

DAIOSFoundation/llm-server

Repository files navigation

LLM Lab (llama.cpp + MLX + React)

This project serves LLMs via llama.cpp's llama-server (GGUF format) and MLX Python API (MLX format), providing a React UI for configuration, chat, and monitoring.

  • Client (UI): frontend/ (React + Vite)
  • Server (Inference/Router):
    • llama.cpp/build/bin/llama-server (GGUF models, Router Mode recommended)
    • mlx/server-python-direct.py (MLX models, Python FastAPI + mlx_lm library)
  • Desktop (optional): Electron wrapper (npm run desktop)

Project Structure

Architecture Overview

This project consists of 4 independent servers and a frontend client:

  1. GGUF Server (Port 8080): Uses llama.cpp's llama-server to serve GGUF format models
  2. MLX Server (Port 8081): Uses MLX Python API (mlx_lm) to serve MLX format models
  3. Authentication Server (Port 8082): Handles login/logout/setup independently (runs independently from model servers)
  4. Client Server Manager (Port 8083): Automatically manages model servers in client-only mode
  5. Frontend (Port 5173): React + Vite based web UI

Directory Structure

llm-server/
├─ frontend/                      # React (Vite) client
│  ├─ src/
│  │  ├─ components/              # React components
│  │  ├─ pages/                   # Page components
│  │  ├─ contexts/                # React Context (Auth, etc.)
│  │  └─ services/                # API service layer
│  └─ package.json
│
├─ llama.cpp/                     # llama.cpp (Git submodule)
│  └─ build/bin/llama-server      # GGUF server executable
│
├─ mlx/                           # MLX server (Python-based)
│  ├─ server-python-direct.py    # MLX HTTP/WebSocket server (port 8081, FastAPI)
│  ├─ requirements.txt            # Python dependencies
│  ├─ venv/                       # Python virtual environment
│  └─ models/                      # MLX model directory
│
├─ native/                        # Metal VRAM monitoring native module
│  ├─ src/
│  │  └─ vram_monitor.cpp         # macOS Metal VRAM monitoring
│  └─ binding.gyp                 # node-gyp build configuration
│
├─ auth-server.js                 # Authentication server (port 8082)
├─ start-client-server.js          # Client server manager (port 8083)
├─ mlx-verify-proxy.js            # MLX model verification proxy (port 8084)
│
├─ config.json                     # Client model configuration (localStorage sync)
├─ models-config.json              # Server-side model load options (router mode)
├─ .auth.json                      # Super admin password hash (PBKDF2, gitignore)
│
├─ main.js / preload.js            # (Optional) Electron wrapper
└─ package.json                    # Run/build scripts

Server Components

1. GGUF Server (Port 8080)

  • Executable: llama.cpp/build/bin/llama-server
  • Native Library: llama.cpp (C++ implementation)
  • Functionality: GGUF format model loading and inference
  • Startup: Auto-started by start-client-server.js or manually executed

2. MLX Server (Port 8081)

  • Server File: mlx/server-python-direct.py (FastAPI-based Python HTTP/WebSocket server)
  • Framework: FastAPI + Uvicorn
  • Functionality: MLX format model loading and inference with real-time WebSocket streaming
  • Startup: Auto-started by start-client-server.js (uses venv Python) or manually executed
  • Note: The Python FastAPI-based server uses the mlx_lm library to reliably load models and perform inference, supporting real-time streaming via WebSocket.

3. Authentication Server (Port 8082)

  • Server File: auth-server.js
  • Functionality: User authentication (login/logout/setup), super admin account management
  • Independence: Runs independently from model servers (login works even if model servers are down)

4. Client Server Manager (Port 8083)

  • Server File: start-client-server.js

  • Role: Central manager that automatically manages model servers in client-only mode (running frontend only in browser)

  • Key Features:

    1. Automatic Server Startup/Management

      • Reads config.json on initial load and automatically starts both GGUF and MLX servers
      • Monitors config.json file changes (fs.watchFile)
      • Starts appropriate server based on model format (GGUF → port 8080, MLX → port 8081)
    2. Frontend Configuration Synchronization

      • Handles model configuration save requests from frontend via /api/save-config endpoint
      • Saves received configuration to config.json file
      • Automatically starts servers if needed after saving configuration
    3. Server Process Management

      • GGUF Server: Spawns and manages llama.cpp/build/bin/llama-server process
      • MLX Server: Spawns and manages mlx/server-python-direct.py Python process (uses venv)
      • Handles server process termination and restart
    4. Client Mode Support

      • Required when running frontend only in browser without Electron
      • Frontend cannot directly start servers, so a separate Node.js process manages servers
      • Acts as a bridge between frontend and servers
  • Use Cases:

    • Development: Auto-starts with frontend when running npm run client:all
    • Production: Run as a separate process to allow frontend to control servers

5. Frontend (Port 5173)

  • Framework: React + Vite
  • Functionality:
    • Model configuration and management UI
    • Chat interface
    • Real-time performance monitoring
    • Server log streaming

Native Modules

llama.cpp

  • Location: llama.cpp/ (Git submodule)
  • Purpose: GGUF model loading and inference
  • Build: Uses CMake to generate llama-server executable

MLX Python Library

  • Installation: pip install mlx-lm mlx
  • Purpose: MLX model loading and inference (Apple Silicon optimized)
  • Integration: Used directly in mlx/server-python-direct.py (FastAPI server)

Native VRAM Monitor

  • Location: native/src/vram_monitor.cpp
  • Purpose: macOS Metal VRAM usage monitoring
  • Build: Uses node-gyp to build as Node.js native module

Key Features

  • Dual Model Format Support (GGUF/MLX)
    • GGUF Format: Uses llama.cpp's llama-server for GGUF models (Port 8080)
    • MLX Format: Uses MLX Python API (mlx_lm) for MLX models (Port 8081, Apple Silicon optimized)
    • Dual Server Architecture: Both servers run simultaneously; frontend automatically selects the correct port based on model format
    • No server restart required when switching models - frontend simply changes the endpoint
  • Login (Super Admin)
    • Separate authentication server (Port 8082) - login works even if model servers are down
    • Create a super-admin account on first run → then login on subsequent runs
    • Password is stored in .auth.json as PBKDF2 hash (no plaintext storage)
  • Model Configuration & Management
    • Add/edit/delete multiple models by Model ID (= model name in router mode)
    • Model format selection (GGUF/MLX) with automatic path validation
    • When saving settings, the UI sends model load settings to the server and applies them via router unload/load
  • GGUF Metadata (Quantization) Display
    • Reads GGUF via POST /gguf-info and shows a summary of quantization / tensor types / QKV types
    • When loading settings, if a model ID exists, metadata is fetched automatically and the summary is shown
  • MLX Python Implementation
    • Python mlx_lm library integration via FastAPI server
    • Automatic model loading, sharding, and weight merging
    • Built-in tokenization support
    • Sampling strategies (Temperature, Top-P, Repetition Penalty)
    • Multiple streaming options:
      • HTTP POST with SSE streaming (/chat, /completion)
      • WebSocket real-time streaming (/chat/ws)
    • Real-time metrics streaming via WebSocket (/metrics/stream)
    • Real-time log streaming via WebSocket (/logs/stream)
    • Model loading progress tracking with memory usage monitoring
    • llama.cpp API compatibility (/completion endpoint)
  • Real-time Performance Metrics (Push-based)
    • Updates VRAM / memory / CPU / token speed via GET /metrics/stream (SSE) without polling
  • GPU “Utilization (Compute Busy %)”
    • The current GPU gauge in the UI shows VRAM occupancy (%), not actual GPU compute utilization.
    • llama.cpp’s default metrics do not expose a cross-platform “GPU busy %”, so additional implementation is required for Linux production.
    • (Future) On Linux, these approaches are practical:
      • NVIDIA: NVML (e.g., nvmlDeviceGetUtilizationRates, nvmlDeviceGetMemoryInfo) to collect gpuUtil%/vramUsed/vramTotal
      • AMD: rocm-smi / libdrm / sysfs-based collection
      • Intel: intel_gpu_top / sysfs-based collection
    • Recommended implementation location: integrate into server-side metrics collection (server task) and expose via /metrics and /metrics/stream.
  • Guide Page
    • Provides example cards for Curl / JS / React / Python / Java / C# / C++ (collapsed by default)

Server Architecture

The application uses a multi-server architecture with separate ports for different services:

Default Ports

  • GGUF Server: Port 8080 (default)
    • Uses llama.cpp's llama-server for GGUF models
    • Started via start-client-server.js with --port 8080 flag
  • MLX Server: Port 8081 (default)
    • Uses MLX Python API (mlx_lm) server for MLX models
    • Configured in mlx/server-python-direct.py (default: 8081, via PORT env var)
  • Authentication Server: Port 8082 (default)
    • Handles login/logout/setup independently
    • Runs even when model servers are down
  • Client Server Manager: Port 8083 (default)
    • Manages model servers in client-only mode
    • Receives config updates from frontend via /api/save-config

Port Configuration

  • GGUF Server Port: Can be changed via LLAMA_PORT environment variable (default: 8080)
  • MLX Server Port: Configured via PORT environment variable in mlx/server-python-direct.py (default: 8081)
  • Authentication Server Port: Hardcoded in auth-server.js (default: 8082)
  • Client Server Manager Port: Hardcoded in start-client-server.js (default: 8083)

Both GGUF and MLX servers run simultaneously. The frontend automatically selects the correct port based on the selected model's format.

Server API Summary

llama.cpp Server (GGUF Models) - Port 8080

  • Health: GET /health
  • Models (Router Mode): GET /models
  • Model control (Router Mode)
    • POST /models/load / POST /models/unload
    • GET /models/config / POST /models/config (stored in server-side models-config.json)
  • Completion: POST /completion (streaming SSE)
  • Metrics
    • GET /metrics (Prometheus text)
    • GET /metrics/stream (SSE, for the real-time panel)
  • GGUF info: POST /gguf-info (server reads a GGUF file and returns metadata JSON)
  • Tokenization: POST /tokenize (text to tokens)
  • Logs: GET /logs/stream (SSE, server logs)

MLX Server (MLX Models) - Port 8081

  • Health: GET /health - Server health status
  • Models: GET /models - List available models
  • Chat Completion:
    • POST /chat - HTTP POST with SSE streaming response
    • WebSocket /chat/ws - WebSocket-based real-time chat with token-by-token streaming
  • Completion (llama.cpp compatible): POST /completion - SSE streaming (compatible with llama.cpp API)
  • Metrics:
    • GET /metrics - Current VRAM and performance metrics (JSON)
    • WebSocket /metrics/stream - Real-time metrics streaming via WebSocket (for the real-time panel)
  • Tokenization: POST /tokenize - Text to tokens conversion
  • Logs: WebSocket /logs/stream - Real-time server logs streaming via WebSocket

Authentication Server - Port 8082

  • Status: GET /auth/status
  • Setup: POST /auth/setup (create super-admin account)
  • Login: POST /auth/login
  • Logout: POST /auth/logout

Client Server Manager - Port 8083

  • Save Config: POST /api/save-config
    • Model configuration save request from frontend
    • Request body: { models: [...], activeModelId: "..." }
    • Action: Saves to config.json file and automatically starts servers if needed
    • Response: { success: true }

Client Server Manager Role:

  • Automatically manages model servers in client-only mode (running frontend only in browser)
  • Acts as a bridge between frontend and servers, allowing frontend to control servers without Electron
  • Monitors config.json file and automatically starts/manages servers when model configuration changes

Installation

Common Requirements

  • Node.js: v18+
  • CMake + a working C/C++ toolchain

macOS

  • Xcode Command Line Tools:
xcode-select --install

Ubuntu/Debian

sudo apt update
sudo apt install -y build-essential cmake

Build

1) Install dependencies

npm install
npm install --prefix frontend

2) Build llama.cpp (llama-server)

cd llama.cpp
mkdir -p build && cd build
cmake ..
cmake --build . --config Release -j 8
cd ../..

3) MLX Server Setup (for MLX models)

The MLX server is Python FastAPI-based and supports real-time streaming via WebSocket.

# Create Python virtual environment and install dependencies
cd mlx
python3 -m venv venv
source venv/bin/activate  # macOS/Linux
# or venv\Scripts\activate  # Windows
pip install -r requirements.txt

# Run server (after activating venv)
python3 server-python-direct.py

Advantages:

  • ✅ mlx_lm automatically handles sharding, structure, and weight merging
  • ✅ Model structure updates via pip install -U mlx-lm only
  • ✅ Enhanced stability and maintainability
  • ✅ Full support for complex models like DeepSeek-MoE
  • ✅ WebSocket-based real-time streaming (tokens, metrics, logs)

Environment Variables:

# Specify model path (required)
export MLX_MODEL_PATH="./models/deepseek-moe-16b-chat-mlx-q4_0"
# Specify port (optional, default: 8081)
export PORT=8081
python3 server-python-direct.py

4) Build Native Modules

npm run build:native

This command builds:

  • native/: Metal VRAM monitor (macOS)

Run (Development)

Quick Start

The easiest way to run the entire project:

npm run client:all

This single command starts all required services:

  • Client Server Manager (port 8083): Automatically manages GGUF and MLX model servers
  • Auth Server (port 8082): Handles authentication (login/logout/setup)
  • Frontend (port 5173): React web UI

After starting, open your browser and navigate to: http://localhost:5173

The Client Server Manager will automatically start the GGUF server (port 8080) and MLX server (port 8081) based on your model configuration.

Detailed Run Options

Client Mode (Recommended - Standalone Frontend)

In client mode, start-client-server.js automatically manages both model servers (GGUF and MLX):

npm run client:all  # Run all services: frontend + client server manager + auth server

What gets started:

  • Client Server Manager (port 8083): Manages GGUF and MLX model servers
  • Auth Server (port 8082): Handles authentication (login/logout/setup)
  • Frontend (port 5173): React web UI

Individual service commands:

npm run client        # Run frontend only (port 5173)
npm run client:server # Run client server manager only (port 8083)
npm run client:auth   # Run authentication server only (port 8082)

Client Server Manager behavior:

  • On initial load, checks all models in config.json and starts both GGUF and MLX servers
  • On model change, saves config only without restarting servers (frontend automatically requests correct port)
  • Monitors config.json file changes and automatically manages servers

Note: When running npm run client:all, all three services (Client Server Manager, Auth Server, and Frontend) start together. The Auth Server is required for login functionality.

GGUF Server Only (Manual)

To run only the GGUF server manually:

npm run server

Default options (root package.json):

  • --port ${LLAMA_PORT:-8080}
  • --metrics
  • --models-dir "./llama.cpp/models"
  • --models-config "./models-config.json"

Electron Mode (Optional)

For desktop application mode:

npm run desktop

This runs the Electron wrapper with the frontend bundled.


Configuration / Data Locations

Client configuration

  • In client-only mode (browser/Vite), model list and active model are stored in localStorage:
    • llmServerClientConfig
    • modelConfig (chat request parameters)

Server configuration

  • config.json: client-side model configuration (active model, model list, settings)
  • models-config.json: server-side per-model load options (e.g., contextSize, gpuLayers, modelFormat)
  • .auth.json: super-admin password hash file (PBKDF2, gitignored)

Model Directories

  • GGUF Models: llama.cpp/models/ (Model ID = directory name)
  • MLX Models: mlx/models/ (Model ID = directory name, must contain config.json and model weights)

Deployment (Packaging)

Electron packaging (optional):

npm run build
  • Builds the frontend production bundle (frontend/dist)
  • Packages via electron-builder
  • Includes the llama-server binary via extraResources

MLX Implementation Details

MLX Python API Integration

The MLX server uses Python's mlx_lm library:

  • Model Loading: Uses mlx_lm.load() which automatically handles model loading, sharding, and weight merging
  • Transformer Implementation: Handled by mlx_lm library
    • Multi-Head Attention
    • Feed Forward Network
    • Layer Normalization
  • Tokenization: Uses mlx_lm's built-in tokenizer
  • Sampling: Supports Temperature, Top-P, Repetition Penalty
  • Streaming:
    • Async token generation with multiple streaming options
    • HTTP POST with SSE streaming (/chat, /completion)
    • WebSocket real-time streaming (/chat/ws)
    • Real-time metrics and logs via WebSocket (/metrics/stream, /logs/stream)
  • API Compatibility:
    • Frontend UI integration via /chat endpoint
    • llama.cpp compatibility via /completion endpoint
    • WebSocket support for real-time applications

MLX Model Requirements

MLX model directory must contain:

  • config.json: Model configuration file
  • model.safetensors or *.gguf: Model weight files
  • tokenizer.json: Tokenization configuration (optional)

MLX vs GGUF

Feature GGUF (llama.cpp) MLX
Platform Cross-platform macOS (Apple Silicon)
GPU Acceleration CUDA/Metal/OpenCL Metal (optimized)
Model Format GGUF Safetensors (via mlx_lm)
Tokenization Built-in mlx_lm built-in tokenizer
Performance High Very High (Apple Silicon)
Implementation C++ Python (mlx_lm)

Operations Tips

Reset super-admin (delete account)

To remove the super-admin account and re-create it, delete .auth.json:

rm -f ./.auth.json
# Restart authentication server (or restart the application)

The authentication server runs independently on port 8082, so login/setup works even if model servers are down.

Switching Between Model Formats

Dual Server Architecture: Both GGUF and MLX servers run simultaneously on different ports (8080 and 8081).

When you change the model in the dropdown:

  1. No server restart required - both servers are already running
  2. Frontend automatically selects the correct port based on model format:
    • GGUF models → Port 8080
    • MLX models → Port 8081
  3. All API calls (health check, chat, metrics) automatically use the correct endpoint
  4. PerformancePanel automatically reconnects to the correct metrics stream

Initial Load: On first load or page refresh, start-client-server.js automatically starts both servers if models are configured.

Client Server Manager Workflow:

  1. Frontend Start → Browser accesses http://localhost:5173
  2. Config Load → Frontend loads config from localStorage or calls /api/save-config
  3. Automatic Server Start → Client Server Manager reads config.json and starts required servers
    • If GGUF model exists → Start GGUF server on port 8080
    • If MLX model exists → Start MLX server on port 8081
  4. Model Switch → Frontend calls /api/save-config when model is selected
  5. Automatic Port Selection → Frontend automatically requests API to correct port based on selected model format
    • GGUF model selected → Use http://localhost:8080
    • MLX model selected → Use http://localhost:8081

Environment variables

  • Client
    • VITE_LLAMACPP_BASE_URL: API base URL (default http://localhost:8080)
  • Server
    • LLAMA_PORT: server port (default 8080)

Testing

MLX Server Testing

Automated test scripts are provided to test the MLX server's inference functionality.

Manual Server Execution

To run the MLX server manually:

cd mlx
source venv/bin/activate  # Activate virtual environment
python3 server-python-direct.py

The server runs on port 8081 by default.

Running Inference Tests

To run inference tests:

cd mlx
node test-inference.js

Test Script Behavior:

  1. Automatic Server Start: The test script automatically starts the MLX server if it's not already running.
  2. Health Check: Retries up to 3 times with 1-second intervals until the server is ready.
  3. Metrics Collection: Collects server performance metrics.
  4. Inference Request: Sends an inference request with a test prompt.
  5. Result Verification: Verifies inference results and records success/failure.

Test Configuration:

  • Server Port: 8081 (default)
  • Test Prompt: Configurable in test script
  • Health Check Retries: Maximum 3 attempts
  • Server Start Timeout: 30 seconds
  • Inference Timeout: 60 seconds

Test Results:

Upon completion, the test outputs:

  • Number of successful tests
  • Number of failed tests
  • Detailed logs for each test

Notes:

  • Ensure MLX models are properly configured in the mlx/models/ directory before running tests.
  • If the server is already running, the test script uses the existing server.
  • If the server crashes during testing, automatic restart is attempted.

GGUF Server Testing

GGUF server testing can be performed using llama.cpp's default testing tools:

cd llama.cpp
# With llama-server running
curl -X POST http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world", "n_predict": 10}'

Security Notes

The current UI login is a lightweight implementation intended for local development / single-user usage. For remote or multi-user production environments, you should additionally introduce TLS, proper authentication, durable session storage, and authorization controls.


(Future Work) Security Hardening Checklist

  • Transport security (TLS)
    • Expose externally via HTTPS only (reverse proxy/Ingress); consider mTLS for internal traffic
  • Authentication & session hardening
    • Strong random session tokens (e.g., 256-bit) + expiry/refresh (sliding/absolute) + server-side revocation on logout
    • Limit concurrent sessions; add delay/blocking on repeated auth failures (brute-force defense); audit logging
    • For browser deployment: consider cookie-based sessions (HttpOnly/Secure/SameSite) and CSRF protection
  • Authorization (RBAC) separation
    • Apply role-based policies for sensitive endpoints such as model management (/models/*, /models/config), log streaming (/logs/stream), and system metrics (/metrics*)
  • Password storage upgrade
    • Replace custom hashing with a standard KDF (recommended: Argon2id; alternatives: bcrypt/scrypt) + parameter upgrade strategy
  • Secrets & key management
    • Harden user_pw.json permissions (e.g., 600) and storage path; define backup/recovery procedures
    • Never log tokens/passwords/API keys; mask secrets in logs/errors
  • Rate limiting & resource limits
    • Apply per-IP/per-account rate limiting to /completion and streaming endpoints (/metrics/stream, /logs/stream)
    • Limit request body size; cap concurrent streams/requests (DoS mitigation)
  • Input validation & path safety
    • Restrict model paths/IDs to an allowlisted directory; prevent .. and symlink escape
    • Apply file size limits and timeouts when reading/parsing GGUF metadata
  • Operational hardening
    • Default bind to 127.0.0.1; expose externally only via a controlled proxy layer
    • Run as least-privileged user; use container isolation (AppArmor/SELinux); operate security updates and SCA scanning

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 203