LLM Lab (llama.cpp + MLX + React)

This project serves LLMs via llama.cpp's llama-server (GGUF format) and MLX Python API (MLX format), providing a React UI for configuration, chat, and monitoring.

Client (UI): frontend/ (React + Vite)
Server (Inference/Router):
- llama.cpp/build/bin/llama-server (GGUF models, Router Mode recommended)
- mlx/server-python-direct.py (MLX models, Python FastAPI + mlx_lm library)
Desktop (optional): Electron wrapper (npm run desktop)

Project Structure

Architecture Overview

This project consists of 4 independent servers and a frontend client:

GGUF Server (Port 8080): Uses llama.cpp's llama-server to serve GGUF format models
MLX Server (Port 8081): Uses MLX Python API (mlx_lm) to serve MLX format models
Authentication Server (Port 8082): Handles login/logout/setup independently (runs independently from model servers)
Client Server Manager (Port 8083): Automatically manages model servers in client-only mode
Frontend (Port 5173): React + Vite based web UI

Directory Structure

llm-server/
├─ frontend/                      # React (Vite) client
│  ├─ src/
│  │  ├─ components/              # React components
│  │  ├─ pages/                   # Page components
│  │  ├─ contexts/                # React Context (Auth, etc.)
│  │  └─ services/                # API service layer
│  └─ package.json
│
├─ llama.cpp/                     # llama.cpp (Git submodule)
│  └─ build/bin/llama-server      # GGUF server executable
│
├─ mlx/                           # MLX server (Python-based)
│  ├─ server-python-direct.py    # MLX HTTP/WebSocket server (port 8081, FastAPI)
│  ├─ requirements.txt            # Python dependencies
│  ├─ venv/                       # Python virtual environment
│  └─ models/                      # MLX model directory
│
├─ native/                        # Metal VRAM monitoring native module
│  ├─ src/
│  │  └─ vram_monitor.cpp         # macOS Metal VRAM monitoring
│  └─ binding.gyp                 # node-gyp build configuration
│
├─ auth-server.js                 # Authentication server (port 8082)
├─ start-client-server.js          # Client server manager (port 8083)
├─ mlx-verify-proxy.js            # MLX model verification proxy (port 8084)
│
├─ config.json                     # Client model configuration (localStorage sync)
├─ models-config.json              # Server-side model load options (router mode)
├─ .auth.json                      # Super admin password hash (PBKDF2, gitignore)
│
├─ main.js / preload.js            # (Optional) Electron wrapper
└─ package.json                    # Run/build scripts

Server Components

1. GGUF Server (Port 8080)

Executable: llama.cpp/build/bin/llama-server
Native Library: llama.cpp (C++ implementation)
Functionality: GGUF format model loading and inference
Startup: Auto-started by start-client-server.js or manually executed

2. MLX Server (Port 8081)

Server File: mlx/server-python-direct.py (FastAPI-based Python HTTP/WebSocket server)
Framework: FastAPI + Uvicorn
Functionality: MLX format model loading and inference with real-time WebSocket streaming
Startup: Auto-started by start-client-server.js (uses venv Python) or manually executed
Note: The Python FastAPI-based server uses the mlx_lm library to reliably load models and perform inference, supporting real-time streaming via WebSocket.

3. Authentication Server (Port 8082)

Server File: auth-server.js
Functionality: User authentication (login/logout/setup), super admin account management
Independence: Runs independently from model servers (login works even if model servers are down)

4. Client Server Manager (Port 8083)

Server File: start-client-server.js
Role: Central manager that automatically manages model servers in client-only mode (running frontend only in browser)
Key Features:
1. Automatic Server Startup/Management
  - Reads config.json on initial load and automatically starts both GGUF and MLX servers
  - Monitors config.json file changes (fs.watchFile)
  - Starts appropriate server based on model format (GGUF → port 8080, MLX → port 8081)
2. Frontend Configuration Synchronization
  - Handles model configuration save requests from frontend via /api/save-config endpoint
  - Saves received configuration to config.json file
  - Automatically starts servers if needed after saving configuration
3. Server Process Management
  - GGUF Server: Spawns and manages llama.cpp/build/bin/llama-server process
  - MLX Server: Spawns and manages mlx/server-python-direct.py Python process (uses venv)
  - Handles server process termination and restart
4. Client Mode Support
  - Required when running frontend only in browser without Electron
  - Frontend cannot directly start servers, so a separate Node.js process manages servers
  - Acts as a bridge between frontend and servers
Use Cases:
- Development: Auto-starts with frontend when running npm run client:all
- Production: Run as a separate process to allow frontend to control servers

5. Frontend (Port 5173)

Framework: React + Vite
Functionality:
- Model configuration and management UI
- Chat interface
- Real-time performance monitoring
- Server log streaming

Native Modules

llama.cpp

Location: llama.cpp/ (Git submodule)
Purpose: GGUF model loading and inference
Build: Uses CMake to generate llama-server executable

MLX Python Library

Installation: pip install mlx-lm mlx
Purpose: MLX model loading and inference (Apple Silicon optimized)
Integration: Used directly in mlx/server-python-direct.py (FastAPI server)

Native VRAM Monitor

Location: native/src/vram_monitor.cpp
Purpose: macOS Metal VRAM usage monitoring
Build: Uses node-gyp to build as Node.js native module

Key Features

Dual Model Format Support (GGUF/MLX)
- GGUF Format: Uses llama.cpp's llama-server for GGUF models (Port 8080)
- MLX Format: Uses MLX Python API (mlx_lm) for MLX models (Port 8081, Apple Silicon optimized)
- Dual Server Architecture: Both servers run simultaneously; frontend automatically selects the correct port based on model format
- No server restart required when switching models - frontend simply changes the endpoint
Login (Super Admin)
- Separate authentication server (Port 8082) - login works even if model servers are down
- Create a super-admin account on first run → then login on subsequent runs
- Password is stored in .auth.json as PBKDF2 hash (no plaintext storage)
Model Configuration & Management
- Add/edit/delete multiple models by Model ID (= model name in router mode)
- Model format selection (GGUF/MLX) with automatic path validation
- When saving settings, the UI sends model load settings to the server and applies them via router unload/load
GGUF Metadata (Quantization) Display
- Reads GGUF via POST /gguf-info and shows a summary of quantization / tensor types / QKV types
- When loading settings, if a model ID exists, metadata is fetched automatically and the summary is shown
MLX Python Implementation
- Python mlx_lm library integration via FastAPI server
- Automatic model loading, sharding, and weight merging
- Built-in tokenization support
- Sampling strategies (Temperature, Top-P, Repetition Penalty)
- Multiple streaming options:
  - HTTP POST with SSE streaming (/chat, /completion)
  - WebSocket real-time streaming (/chat/ws)
- Real-time metrics streaming via WebSocket (/metrics/stream)
- Real-time log streaming via WebSocket (/logs/stream)
- Model loading progress tracking with memory usage monitoring
- llama.cpp API compatibility (/completion endpoint)
Real-time Performance Metrics (Push-based)
- Updates VRAM / memory / CPU / token speed via GET /metrics/stream (SSE) without polling
GPU “Utilization (Compute Busy %)”
- The current GPU gauge in the UI shows VRAM occupancy (%), not actual GPU compute utilization.
- llama.cpp’s default metrics do not expose a cross-platform “GPU busy %”, so additional implementation is required for Linux production.
- (Future) On Linux, these approaches are practical:
  - NVIDIA: NVML (e.g., nvmlDeviceGetUtilizationRates, nvmlDeviceGetMemoryInfo) to collect gpuUtil%/vramUsed/vramTotal
  - AMD: rocm-smi / libdrm / sysfs-based collection
  - Intel: intel_gpu_top / sysfs-based collection
- Recommended implementation location: integrate into server-side metrics collection (server task) and expose via /metrics and /metrics/stream.
Guide Page
- Provides example cards for Curl / JS / React / Python / Java / C# / C++ (collapsed by default)

Server Architecture

The application uses a multi-server architecture with separate ports for different services:

Default Ports

GGUF Server: Port 8080 (default)
- Uses llama.cpp's llama-server for GGUF models
- Started via start-client-server.js with --port 8080 flag
MLX Server: Port 8081 (default)
- Uses MLX Python API (mlx_lm) server for MLX models
- Configured in mlx/server-python-direct.py (default: 8081, via PORT env var)
Authentication Server: Port 8082 (default)
- Handles login/logout/setup independently
- Runs even when model servers are down
Client Server Manager: Port 8083 (default)
- Manages model servers in client-only mode
- Receives config updates from frontend via /api/save-config

Port Configuration

GGUF Server Port: Can be changed via LLAMA_PORT environment variable (default: 8080)
MLX Server Port: Configured via PORT environment variable in mlx/server-python-direct.py (default: 8081)
Authentication Server Port: Hardcoded in auth-server.js (default: 8082)
Client Server Manager Port: Hardcoded in start-client-server.js (default: 8083)

Both GGUF and MLX servers run simultaneously. The frontend automatically selects the correct port based on the selected model's format.

Server API Summary

llama.cpp Server (GGUF Models) - Port 8080

Health: GET /health
Models (Router Mode): GET /models
Model control (Router Mode)
- POST /models/load / POST /models/unload
- GET /models/config / POST /models/config (stored in server-side models-config.json)
Completion: POST /completion (streaming SSE)
Metrics
- GET /metrics (Prometheus text)
- GET /metrics/stream (SSE, for the real-time panel)
GGUF info: POST /gguf-info (server reads a GGUF file and returns metadata JSON)
Tokenization: POST /tokenize (text to tokens)
Logs: GET /logs/stream (SSE, server logs)

MLX Server (MLX Models) - Port 8081

Health: GET /health - Server health status
Models: GET /models - List available models
Chat Completion:
- POST /chat - HTTP POST with SSE streaming response
- WebSocket /chat/ws - WebSocket-based real-time chat with token-by-token streaming
Completion (llama.cpp compatible): POST /completion - SSE streaming (compatible with llama.cpp API)
Metrics:
- GET /metrics - Current VRAM and performance metrics (JSON)
- WebSocket /metrics/stream - Real-time metrics streaming via WebSocket (for the real-time panel)
Tokenization: POST /tokenize - Text to tokens conversion
Logs: WebSocket /logs/stream - Real-time server logs streaming via WebSocket

Authentication Server - Port 8082

Status: GET /auth/status
Setup: POST /auth/setup (create super-admin account)
Login: POST /auth/login
Logout: POST /auth/logout

Client Server Manager - Port 8083

Save Config: POST /api/save-config
- Model configuration save request from frontend
- Request body: { models: [...], activeModelId: "..." }
- Action: Saves to config.json file and automatically starts servers if needed
- Response: { success: true }

Client Server Manager Role:

Automatically manages model servers in client-only mode (running frontend only in browser)
Acts as a bridge between frontend and servers, allowing frontend to control servers without Electron
Monitors config.json file and automatically starts/manages servers when model configuration changes

Installation

Common Requirements

Node.js: v18+
CMake + a working C/C++ toolchain

macOS

Xcode Command Line Tools:

xcode-select --install

Ubuntu/Debian

sudo apt update
sudo apt install -y build-essential cmake

Build

1) Install dependencies

npm install
npm install --prefix frontend

2) Build llama.cpp (llama-server)

cd llama.cpp
mkdir -p build && cd build
cmake ..
cmake --build . --config Release -j 8
cd ../..

3) MLX Server Setup (for MLX models)

The MLX server is Python FastAPI-based and supports real-time streaming via WebSocket.

# Create Python virtual environment and install dependencies
cd mlx
python3 -m venv venv
source venv/bin/activate  # macOS/Linux
# or venv\Scripts\activate  # Windows
pip install -r requirements.txt

# Run server (after activating venv)
python3 server-python-direct.py

Advantages:

✅ mlx_lm automatically handles sharding, structure, and weight merging
✅ Model structure updates via pip install -U mlx-lm only
✅ Enhanced stability and maintainability
✅ Full support for complex models like DeepSeek-MoE
✅ WebSocket-based real-time streaming (tokens, metrics, logs)

Environment Variables:

# Specify model path (required)
export MLX_MODEL_PATH="./models/deepseek-moe-16b-chat-mlx-q4_0"
# Specify port (optional, default: 8081)
export PORT=8081
python3 server-python-direct.py

4) Build Native Modules

npm run build:native

This command builds:

native/: Metal VRAM monitor (macOS)

Run (Development)

Quick Start

The easiest way to run the entire project:

npm run client:all

This single command starts all required services:

Client Server Manager (port 8083): Automatically manages GGUF and MLX model servers
Auth Server (port 8082): Handles authentication (login/logout/setup)
Frontend (port 5173): React web UI

After starting, open your browser and navigate to: http://localhost:5173

The Client Server Manager will automatically start the GGUF server (port 8080) and MLX server (port 8081) based on your model configuration.

Detailed Run Options

Client Mode (Recommended - Standalone Frontend)

In client mode, start-client-server.js automatically manages both model servers (GGUF and MLX):

npm run client:all  # Run all services: frontend + client server manager + auth server

What gets started:

Client Server Manager (port 8083): Manages GGUF and MLX model servers
Auth Server (port 8082): Handles authentication (login/logout/setup)
Frontend (port 5173): React web UI

Individual service commands:

npm run client        # Run frontend only (port 5173)
npm run client:server # Run client server manager only (port 8083)
npm run client:auth   # Run authentication server only (port 8082)

Client Server Manager behavior:

On initial load, checks all models in config.json and starts both GGUF and MLX servers
On model change, saves config only without restarting servers (frontend automatically requests correct port)
Monitors config.json file changes and automatically manages servers

Note: When running npm run client:all, all three services (Client Server Manager, Auth Server, and Frontend) start together. The Auth Server is required for login functionality.

GGUF Server Only (Manual)

To run only the GGUF server manually:

npm run server

Default options (root package.json):

--port ${LLAMA_PORT:-8080}
--metrics
--models-dir "./llama.cpp/models"
--models-config "./models-config.json"

Electron Mode (Optional)

For desktop application mode:

npm run desktop

This runs the Electron wrapper with the frontend bundled.

Configuration / Data Locations

Client configuration

In client-only mode (browser/Vite), model list and active model are stored in localStorage:
- llmServerClientConfig
- modelConfig (chat request parameters)

Server configuration

config.json: client-side model configuration (active model, model list, settings)
models-config.json: server-side per-model load options (e.g., contextSize, gpuLayers, modelFormat)
.auth.json: super-admin password hash file (PBKDF2, gitignored)

Model Directories

GGUF Models: llama.cpp/models/ (Model ID = directory name)
MLX Models: mlx/models/ (Model ID = directory name, must contain config.json and model weights)

Deployment (Packaging)

Electron packaging (optional):

npm run build

Builds the frontend production bundle (frontend/dist)
Packages via electron-builder
Includes the llama-server binary via extraResources

MLX Implementation Details

MLX Python API Integration

The MLX server uses Python's mlx_lm library:

Model Loading: Uses mlx_lm.load() which automatically handles model loading, sharding, and weight merging
Transformer Implementation: Handled by mlx_lm library
- Multi-Head Attention
- Feed Forward Network
- Layer Normalization
Tokenization: Uses mlx_lm's built-in tokenizer
Sampling: Supports Temperature, Top-P, Repetition Penalty
Streaming:
- Async token generation with multiple streaming options
- HTTP POST with SSE streaming (/chat, /completion)
- WebSocket real-time streaming (/chat/ws)
- Real-time metrics and logs via WebSocket (/metrics/stream, /logs/stream)
API Compatibility:
- Frontend UI integration via /chat endpoint
- llama.cpp compatibility via /completion endpoint
- WebSocket support for real-time applications

MLX Model Requirements

MLX model directory must contain:

config.json: Model configuration file
model.safetensors or *.gguf: Model weight files
tokenizer.json: Tokenization configuration (optional)

MLX vs GGUF

Feature	GGUF (llama.cpp)	MLX
Platform	Cross-platform	macOS (Apple Silicon)
GPU Acceleration	CUDA/Metal/OpenCL	Metal (optimized)
Model Format	GGUF	Safetensors (via mlx_lm)
Tokenization	Built-in	mlx_lm built-in tokenizer
Performance	High	Very High (Apple Silicon)
Implementation	C++	Python (mlx_lm)

Operations Tips

Reset super-admin (delete account)

To remove the super-admin account and re-create it, delete .auth.json:

rm -f ./.auth.json
# Restart authentication server (or restart the application)

The authentication server runs independently on port 8082, so login/setup works even if model servers are down.

Switching Between Model Formats

Dual Server Architecture: Both GGUF and MLX servers run simultaneously on different ports (8080 and 8081).

When you change the model in the dropdown:

No server restart required - both servers are already running
Frontend automatically selects the correct port based on model format:
- GGUF models → Port 8080
- MLX models → Port 8081
All API calls (health check, chat, metrics) automatically use the correct endpoint
PerformancePanel automatically reconnects to the correct metrics stream

Initial Load: On first load or page refresh, start-client-server.js automatically starts both servers if models are configured.

Client Server Manager Workflow:

Frontend Start → Browser accesses http://localhost:5173
Config Load → Frontend loads config from localStorage or calls /api/save-config
Automatic Server Start → Client Server Manager reads config.json and starts required servers
- If GGUF model exists → Start GGUF server on port 8080
- If MLX model exists → Start MLX server on port 8081
Model Switch → Frontend calls /api/save-config when model is selected
Automatic Port Selection → Frontend automatically requests API to correct port based on selected model format
- GGUF model selected → Use http://localhost:8080
- MLX model selected → Use http://localhost:8081

Environment variables

Client
- VITE_LLAMACPP_BASE_URL: API base URL (default http://localhost:8080)
Server
- LLAMA_PORT: server port (default 8080)

Testing

MLX Server Testing

Automated test scripts are provided to test the MLX server's inference functionality.

Manual Server Execution

To run the MLX server manually:

cd mlx
source venv/bin/activate  # Activate virtual environment
python3 server-python-direct.py

The server runs on port 8081 by default.

Running Inference Tests

To run inference tests:

cd mlx
node test-inference.js

Test Script Behavior:

Automatic Server Start: The test script automatically starts the MLX server if it's not already running.
Health Check: Retries up to 3 times with 1-second intervals until the server is ready.
Metrics Collection: Collects server performance metrics.
Inference Request: Sends an inference request with a test prompt.
Result Verification: Verifies inference results and records success/failure.

Test Configuration:

Server Port: 8081 (default)
Test Prompt: Configurable in test script
Health Check Retries: Maximum 3 attempts
Server Start Timeout: 30 seconds
Inference Timeout: 60 seconds

Test Results:

Upon completion, the test outputs:

Number of successful tests
Number of failed tests
Detailed logs for each test

Notes:

Ensure MLX models are properly configured in the mlx/models/ directory before running tests.
If the server is already running, the test script uses the existing server.
If the server crashes during testing, automatic restart is attempted.

GGUF Server Testing

GGUF server testing can be performed using llama.cpp's default testing tools:

cd llama.cpp
# With llama-server running
curl -X POST http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world", "n_predict": 10}'

Security Notes

The current UI login is a lightweight implementation intended for local development / single-user usage. For remote or multi-user production environments, you should additionally introduce TLS, proper authentication, durable session storage, and authorization controls.

(Future Work) Security Hardening Checklist

Transport security (TLS)
- Expose externally via HTTPS only (reverse proxy/Ingress); consider mTLS for internal traffic
Authentication & session hardening
- Strong random session tokens (e.g., 256-bit) + expiry/refresh (sliding/absolute) + server-side revocation on logout
- Limit concurrent sessions; add delay/blocking on repeated auth failures (brute-force defense); audit logging
- For browser deployment: consider cookie-based sessions (HttpOnly/Secure/SameSite) and CSRF protection
Authorization (RBAC) separation
- Apply role-based policies for sensitive endpoints such as model management (/models/*, /models/config), log streaming (/logs/stream), and system metrics (/metrics*)
Password storage upgrade
- Replace custom hashing with a standard KDF (recommended: Argon2id; alternatives: bcrypt/scrypt) + parameter upgrade strategy
Secrets & key management
- Harden user_pw.json permissions (e.g., 600) and storage path; define backup/recovery procedures
- Never log tokens/passwords/API keys; mask secrets in logs/errors
Rate limiting & resource limits
- Apply per-IP/per-account rate limiting to /completion and streaming endpoints (/metrics/stream, /logs/stream)
- Limit request body size; cap concurrent streams/requests (DoS mitigation)
Input validation & path safety
- Restrict model paths/IDs to an allowlisted directory; prevent .. and symlink escape
- Apply file size limits and timeouts when reading/parsing GGUF metadata
Operational hardening
- Default bind to 127.0.0.1; expose externally only via a controlled proxy layer
- Run as least-privileged user; use container isolation (AppArmor/SELinux); operate security updates and SCA scanning

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
frontend		frontend
llama.cpp @ be26ec8		llama.cpp @ be26ec8
mlx		mlx
native		native
shared		shared
.auth.json		.auth.json
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
auth-server.js		auth-server.js
build.sh		build.sh
config.json		config.json
main.js		main.js
mlx-verify-proxy.js		mlx-verify-proxy.js
models-config.json		models-config.json
models.json		models.json
package-lock.json		package-lock.json
package.json		package.json
preload.js		preload.js
start-client-server.js		start-client-server.js
tune-model.js		tune-model.js

DAIOSFoundation/llm-server

Folders and files

Latest commit

History

Repository files navigation