This project serves LLMs via llama.cpp's llama-server (GGUF format) and MLX Python API (MLX format), providing a React UI for configuration, chat, and monitoring.
- Client (UI):
frontend/(React + Vite) - Server (Inference/Router):
llama.cpp/build/bin/llama-server(GGUF models, Router Mode recommended)mlx/server-python-direct.py(MLX models, Python FastAPI + mlx_lm library)
- Desktop (optional): Electron wrapper (
npm run desktop)
This project consists of 4 independent servers and a frontend client:
- GGUF Server (Port 8080): Uses
llama.cpp'sllama-serverto serve GGUF format models - MLX Server (Port 8081): Uses MLX Python API (mlx_lm) to serve MLX format models
- Authentication Server (Port 8082): Handles login/logout/setup independently (runs independently from model servers)
- Client Server Manager (Port 8083): Automatically manages model servers in client-only mode
- Frontend (Port 5173): React + Vite based web UI
llm-server/
├─ frontend/ # React (Vite) client
│ ├─ src/
│ │ ├─ components/ # React components
│ │ ├─ pages/ # Page components
│ │ ├─ contexts/ # React Context (Auth, etc.)
│ │ └─ services/ # API service layer
│ └─ package.json
│
├─ llama.cpp/ # llama.cpp (Git submodule)
│ └─ build/bin/llama-server # GGUF server executable
│
├─ mlx/ # MLX server (Python-based)
│ ├─ server-python-direct.py # MLX HTTP/WebSocket server (port 8081, FastAPI)
│ ├─ requirements.txt # Python dependencies
│ ├─ venv/ # Python virtual environment
│ └─ models/ # MLX model directory
│
├─ native/ # Metal VRAM monitoring native module
│ ├─ src/
│ │ └─ vram_monitor.cpp # macOS Metal VRAM monitoring
│ └─ binding.gyp # node-gyp build configuration
│
├─ auth-server.js # Authentication server (port 8082)
├─ start-client-server.js # Client server manager (port 8083)
├─ mlx-verify-proxy.js # MLX model verification proxy (port 8084)
│
├─ config.json # Client model configuration (localStorage sync)
├─ models-config.json # Server-side model load options (router mode)
├─ .auth.json # Super admin password hash (PBKDF2, gitignore)
│
├─ main.js / preload.js # (Optional) Electron wrapper
└─ package.json # Run/build scripts
- Executable:
llama.cpp/build/bin/llama-server - Native Library:
llama.cpp(C++ implementation) - Functionality: GGUF format model loading and inference
- Startup: Auto-started by
start-client-server.jsor manually executed
- Server File:
mlx/server-python-direct.py(FastAPI-based Python HTTP/WebSocket server) - Framework: FastAPI + Uvicorn
- Functionality: MLX format model loading and inference with real-time WebSocket streaming
- Startup: Auto-started by
start-client-server.js(uses venv Python) or manually executed - Note: The Python FastAPI-based server uses the mlx_lm library to reliably load models and perform inference, supporting real-time streaming via WebSocket.
- Server File:
auth-server.js - Functionality: User authentication (login/logout/setup), super admin account management
- Independence: Runs independently from model servers (login works even if model servers are down)
-
Server File:
start-client-server.js -
Role: Central manager that automatically manages model servers in client-only mode (running frontend only in browser)
-
Key Features:
-
Automatic Server Startup/Management
- Reads
config.jsonon initial load and automatically starts both GGUF and MLX servers - Monitors
config.jsonfile changes (fs.watchFile) - Starts appropriate server based on model format (GGUF → port 8080, MLX → port 8081)
- Reads
-
Frontend Configuration Synchronization
- Handles model configuration save requests from frontend via
/api/save-configendpoint - Saves received configuration to
config.jsonfile - Automatically starts servers if needed after saving configuration
- Handles model configuration save requests from frontend via
-
Server Process Management
- GGUF Server: Spawns and manages
llama.cpp/build/bin/llama-serverprocess - MLX Server: Spawns and manages
mlx/server-python-direct.pyPython process (uses venv) - Handles server process termination and restart
- GGUF Server: Spawns and manages
-
Client Mode Support
- Required when running frontend only in browser without Electron
- Frontend cannot directly start servers, so a separate Node.js process manages servers
- Acts as a bridge between frontend and servers
-
-
Use Cases:
- Development: Auto-starts with frontend when running
npm run client:all - Production: Run as a separate process to allow frontend to control servers
- Development: Auto-starts with frontend when running
- Framework: React + Vite
- Functionality:
- Model configuration and management UI
- Chat interface
- Real-time performance monitoring
- Server log streaming
- Location:
llama.cpp/(Git submodule) - Purpose: GGUF model loading and inference
- Build: Uses CMake to generate
llama-serverexecutable
- Installation:
pip install mlx-lm mlx - Purpose: MLX model loading and inference (Apple Silicon optimized)
- Integration: Used directly in
mlx/server-python-direct.py(FastAPI server)
- Location:
native/src/vram_monitor.cpp - Purpose: macOS Metal VRAM usage monitoring
- Build: Uses
node-gypto build as Node.js native module
- Dual Model Format Support (GGUF/MLX)
- GGUF Format: Uses
llama.cpp'sllama-serverfor GGUF models (Port 8080) - MLX Format: Uses MLX Python API (mlx_lm) for MLX models (Port 8081, Apple Silicon optimized)
- Dual Server Architecture: Both servers run simultaneously; frontend automatically selects the correct port based on model format
- No server restart required when switching models - frontend simply changes the endpoint
- GGUF Format: Uses
- Login (Super Admin)
- Separate authentication server (Port 8082) - login works even if model servers are down
- Create a super-admin account on first run → then login on subsequent runs
- Password is stored in
.auth.jsonas PBKDF2 hash (no plaintext storage)
- Model Configuration & Management
- Add/edit/delete multiple models by Model ID (= model name in router mode)
- Model format selection (GGUF/MLX) with automatic path validation
- When saving settings, the UI sends model load settings to the server and applies them via router unload/load
- GGUF Metadata (Quantization) Display
- Reads GGUF via
POST /gguf-infoand shows a summary of quantization / tensor types / QKV types - When loading settings, if a model ID exists, metadata is fetched automatically and the summary is shown
- Reads GGUF via
- MLX Python Implementation
- Python mlx_lm library integration via FastAPI server
- Automatic model loading, sharding, and weight merging
- Built-in tokenization support
- Sampling strategies (Temperature, Top-P, Repetition Penalty)
- Multiple streaming options:
- HTTP POST with SSE streaming (
/chat,/completion) - WebSocket real-time streaming (
/chat/ws)
- HTTP POST with SSE streaming (
- Real-time metrics streaming via WebSocket (
/metrics/stream) - Real-time log streaming via WebSocket (
/logs/stream) - Model loading progress tracking with memory usage monitoring
- llama.cpp API compatibility (
/completionendpoint)
- Real-time Performance Metrics (Push-based)
- Updates VRAM / memory / CPU / token speed via
GET /metrics/stream(SSE) without polling
- Updates VRAM / memory / CPU / token speed via
- GPU “Utilization (Compute Busy %)”
- The current GPU gauge in the UI shows VRAM occupancy (%), not actual GPU compute utilization.
llama.cpp’s default metrics do not expose a cross-platform “GPU busy %”, so additional implementation is required for Linux production.- (Future) On Linux, these approaches are practical:
- NVIDIA: NVML (e.g.,
nvmlDeviceGetUtilizationRates,nvmlDeviceGetMemoryInfo) to collectgpuUtil%/vramUsed/vramTotal - AMD:
rocm-smi/libdrm/ sysfs-based collection - Intel:
intel_gpu_top/ sysfs-based collection
- NVIDIA: NVML (e.g.,
- Recommended implementation location: integrate into server-side metrics collection (server task) and expose via
/metricsand/metrics/stream.
- Guide Page
- Provides example cards for Curl / JS / React / Python / Java / C# / C++ (collapsed by default)
The application uses a multi-server architecture with separate ports for different services:
- GGUF Server: Port 8080 (default)
- Uses
llama.cpp'sllama-serverfor GGUF models - Started via
start-client-server.jswith--port 8080flag
- Uses
- MLX Server: Port 8081 (default)
- Uses MLX Python API (mlx_lm) server for MLX models
- Configured in
mlx/server-python-direct.py(default: 8081, via PORT env var)
- Authentication Server: Port 8082 (default)
- Handles login/logout/setup independently
- Runs even when model servers are down
- Client Server Manager: Port 8083 (default)
- Manages model servers in client-only mode
- Receives config updates from frontend via
/api/save-config
- GGUF Server Port: Can be changed via
LLAMA_PORTenvironment variable (default: 8080) - MLX Server Port: Configured via
PORTenvironment variable inmlx/server-python-direct.py(default: 8081) - Authentication Server Port: Hardcoded in
auth-server.js(default: 8082) - Client Server Manager Port: Hardcoded in
start-client-server.js(default: 8083)
Both GGUF and MLX servers run simultaneously. The frontend automatically selects the correct port based on the selected model's format.
- Health:
GET /health - Models (Router Mode):
GET /models - Model control (Router Mode)
POST /models/load/POST /models/unloadGET /models/config/POST /models/config(stored in server-sidemodels-config.json)
- Completion:
POST /completion(streaming SSE) - Metrics
GET /metrics(Prometheus text)GET /metrics/stream(SSE, for the real-time panel)
- GGUF info:
POST /gguf-info(server reads a GGUF file and returns metadata JSON) - Tokenization:
POST /tokenize(text to tokens) - Logs:
GET /logs/stream(SSE, server logs)
- Health:
GET /health- Server health status - Models:
GET /models- List available models - Chat Completion:
POST /chat- HTTP POST with SSE streaming responseWebSocket /chat/ws- WebSocket-based real-time chat with token-by-token streaming
- Completion (llama.cpp compatible):
POST /completion- SSE streaming (compatible with llama.cpp API) - Metrics:
GET /metrics- Current VRAM and performance metrics (JSON)WebSocket /metrics/stream- Real-time metrics streaming via WebSocket (for the real-time panel)
- Tokenization:
POST /tokenize- Text to tokens conversion - Logs:
WebSocket /logs/stream- Real-time server logs streaming via WebSocket
- Status:
GET /auth/status - Setup:
POST /auth/setup(create super-admin account) - Login:
POST /auth/login - Logout:
POST /auth/logout
- Save Config:
POST /api/save-config- Model configuration save request from frontend
- Request body:
{ models: [...], activeModelId: "..." } - Action: Saves to
config.jsonfile and automatically starts servers if needed - Response:
{ success: true }
Client Server Manager Role:
- Automatically manages model servers in client-only mode (running frontend only in browser)
- Acts as a bridge between frontend and servers, allowing frontend to control servers without Electron
- Monitors
config.jsonfile and automatically starts/manages servers when model configuration changes
- Node.js: v18+
- CMake + a working C/C++ toolchain
- Xcode Command Line Tools:
xcode-select --installsudo apt update
sudo apt install -y build-essential cmakenpm install
npm install --prefix frontendcd llama.cpp
mkdir -p build && cd build
cmake ..
cmake --build . --config Release -j 8
cd ../..The MLX server is Python FastAPI-based and supports real-time streaming via WebSocket.
# Create Python virtual environment and install dependencies
cd mlx
python3 -m venv venv
source venv/bin/activate # macOS/Linux
# or venv\Scripts\activate # Windows
pip install -r requirements.txt
# Run server (after activating venv)
python3 server-python-direct.pyAdvantages:
- ✅ mlx_lm automatically handles sharding, structure, and weight merging
- ✅ Model structure updates via
pip install -U mlx-lmonly - ✅ Enhanced stability and maintainability
- ✅ Full support for complex models like DeepSeek-MoE
- ✅ WebSocket-based real-time streaming (tokens, metrics, logs)
Environment Variables:
# Specify model path (required)
export MLX_MODEL_PATH="./models/deepseek-moe-16b-chat-mlx-q4_0"
# Specify port (optional, default: 8081)
export PORT=8081
python3 server-python-direct.pynpm run build:nativeThis command builds:
native/: Metal VRAM monitor (macOS)
The easiest way to run the entire project:
npm run client:allThis single command starts all required services:
- Client Server Manager (port 8083): Automatically manages GGUF and MLX model servers
- Auth Server (port 8082): Handles authentication (login/logout/setup)
- Frontend (port 5173): React web UI
After starting, open your browser and navigate to: http://localhost:5173
The Client Server Manager will automatically start the GGUF server (port 8080) and MLX server (port 8081) based on your model configuration.
In client mode, start-client-server.js automatically manages both model servers (GGUF and MLX):
npm run client:all # Run all services: frontend + client server manager + auth serverWhat gets started:
- Client Server Manager (port 8083): Manages GGUF and MLX model servers
- Auth Server (port 8082): Handles authentication (login/logout/setup)
- Frontend (port 5173): React web UI
Individual service commands:
npm run client # Run frontend only (port 5173)
npm run client:server # Run client server manager only (port 8083)
npm run client:auth # Run authentication server only (port 8082)Client Server Manager behavior:
- On initial load, checks all models in
config.jsonand starts both GGUF and MLX servers - On model change, saves config only without restarting servers (frontend automatically requests correct port)
- Monitors
config.jsonfile changes and automatically manages servers
Note: When running npm run client:all, all three services (Client Server Manager, Auth Server, and Frontend) start together. The Auth Server is required for login functionality.
To run only the GGUF server manually:
npm run serverDefault options (root package.json):
--port ${LLAMA_PORT:-8080}--metrics--models-dir "./llama.cpp/models"--models-config "./models-config.json"
For desktop application mode:
npm run desktopThis runs the Electron wrapper with the frontend bundled.
- In client-only mode (browser/Vite), model list and active model are stored in localStorage:
llmServerClientConfigmodelConfig(chat request parameters)
config.json: client-side model configuration (active model, model list, settings)models-config.json: server-side per-model load options (e.g.,contextSize,gpuLayers,modelFormat).auth.json: super-admin password hash file (PBKDF2, gitignored)
- GGUF Models:
llama.cpp/models/(Model ID = directory name) - MLX Models:
mlx/models/(Model ID = directory name, must containconfig.jsonand model weights)
Electron packaging (optional):
npm run build- Builds the
frontendproduction bundle (frontend/dist) - Packages via
electron-builder - Includes the
llama-serverbinary viaextraResources
The MLX server uses Python's mlx_lm library:
- Model Loading: Uses
mlx_lm.load()which automatically handles model loading, sharding, and weight merging - Transformer Implementation: Handled by mlx_lm library
- Multi-Head Attention
- Feed Forward Network
- Layer Normalization
- Tokenization: Uses mlx_lm's built-in tokenizer
- Sampling: Supports Temperature, Top-P, Repetition Penalty
- Streaming:
- Async token generation with multiple streaming options
- HTTP POST with SSE streaming (
/chat,/completion) - WebSocket real-time streaming (
/chat/ws) - Real-time metrics and logs via WebSocket (
/metrics/stream,/logs/stream)
- API Compatibility:
- Frontend UI integration via
/chatendpoint - llama.cpp compatibility via
/completionendpoint - WebSocket support for real-time applications
- Frontend UI integration via
MLX model directory must contain:
config.json: Model configuration filemodel.safetensorsor*.gguf: Model weight filestokenizer.json: Tokenization configuration (optional)
| Feature | GGUF (llama.cpp) | MLX |
|---|---|---|
| Platform | Cross-platform | macOS (Apple Silicon) |
| GPU Acceleration | CUDA/Metal/OpenCL | Metal (optimized) |
| Model Format | GGUF | Safetensors (via mlx_lm) |
| Tokenization | Built-in | mlx_lm built-in tokenizer |
| Performance | High | Very High (Apple Silicon) |
| Implementation | C++ | Python (mlx_lm) |
To remove the super-admin account and re-create it, delete .auth.json:
rm -f ./.auth.json
# Restart authentication server (or restart the application)The authentication server runs independently on port 8082, so login/setup works even if model servers are down.
Dual Server Architecture: Both GGUF and MLX servers run simultaneously on different ports (8080 and 8081).
When you change the model in the dropdown:
- No server restart required - both servers are already running
- Frontend automatically selects the correct port based on model format:
- GGUF models → Port 8080
- MLX models → Port 8081
- All API calls (health check, chat, metrics) automatically use the correct endpoint
PerformancePanelautomatically reconnects to the correct metrics stream
Initial Load: On first load or page refresh, start-client-server.js automatically starts both servers if models are configured.
Client Server Manager Workflow:
- Frontend Start → Browser accesses
http://localhost:5173 - Config Load → Frontend loads config from
localStorageor calls/api/save-config - Automatic Server Start → Client Server Manager reads
config.jsonand starts required servers- If GGUF model exists → Start GGUF server on port 8080
- If MLX model exists → Start MLX server on port 8081
- Model Switch → Frontend calls
/api/save-configwhen model is selected - Automatic Port Selection → Frontend automatically requests API to correct port based on selected model format
- GGUF model selected → Use
http://localhost:8080 - MLX model selected → Use
http://localhost:8081
- GGUF model selected → Use
- Client
VITE_LLAMACPP_BASE_URL: API base URL (defaulthttp://localhost:8080)
- Server
LLAMA_PORT: server port (default 8080)
Automated test scripts are provided to test the MLX server's inference functionality.
To run the MLX server manually:
cd mlx
source venv/bin/activate # Activate virtual environment
python3 server-python-direct.pyThe server runs on port 8081 by default.
To run inference tests:
cd mlx
node test-inference.jsTest Script Behavior:
- Automatic Server Start: The test script automatically starts the MLX server if it's not already running.
- Health Check: Retries up to 3 times with 1-second intervals until the server is ready.
- Metrics Collection: Collects server performance metrics.
- Inference Request: Sends an inference request with a test prompt.
- Result Verification: Verifies inference results and records success/failure.
Test Configuration:
- Server Port: 8081 (default)
- Test Prompt: Configurable in test script
- Health Check Retries: Maximum 3 attempts
- Server Start Timeout: 30 seconds
- Inference Timeout: 60 seconds
Test Results:
Upon completion, the test outputs:
- Number of successful tests
- Number of failed tests
- Detailed logs for each test
Notes:
- Ensure MLX models are properly configured in the
mlx/models/directory before running tests. - If the server is already running, the test script uses the existing server.
- If the server crashes during testing, automatic restart is attempted.
GGUF server testing can be performed using llama.cpp's default testing tools:
cd llama.cpp
# With llama-server running
curl -X POST http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, world", "n_predict": 10}'The current UI login is a lightweight implementation intended for local development / single-user usage. For remote or multi-user production environments, you should additionally introduce TLS, proper authentication, durable session storage, and authorization controls.
- Transport security (TLS)
- Expose externally via HTTPS only (reverse proxy/Ingress); consider mTLS for internal traffic
- Authentication & session hardening
- Strong random session tokens (e.g., 256-bit) + expiry/refresh (sliding/absolute) + server-side revocation on logout
- Limit concurrent sessions; add delay/blocking on repeated auth failures (brute-force defense); audit logging
- For browser deployment: consider cookie-based sessions (
HttpOnly/Secure/SameSite) and CSRF protection
- Authorization (RBAC) separation
- Apply role-based policies for sensitive endpoints such as model management (
/models/*,/models/config), log streaming (/logs/stream), and system metrics (/metrics*)
- Apply role-based policies for sensitive endpoints such as model management (
- Password storage upgrade
- Replace custom hashing with a standard KDF (recommended: Argon2id; alternatives: bcrypt/scrypt) + parameter upgrade strategy
- Secrets & key management
- Harden
user_pw.jsonpermissions (e.g., 600) and storage path; define backup/recovery procedures - Never log tokens/passwords/API keys; mask secrets in logs/errors
- Harden
- Rate limiting & resource limits
- Apply per-IP/per-account rate limiting to
/completionand streaming endpoints (/metrics/stream,/logs/stream) - Limit request body size; cap concurrent streams/requests (DoS mitigation)
- Apply per-IP/per-account rate limiting to
- Input validation & path safety
- Restrict model paths/IDs to an allowlisted directory; prevent
..and symlink escape - Apply file size limits and timeouts when reading/parsing GGUF metadata
- Restrict model paths/IDs to an allowlisted directory; prevent
- Operational hardening
- Default bind to
127.0.0.1; expose externally only via a controlled proxy layer - Run as least-privileged user; use container isolation (AppArmor/SELinux); operate security updates and SCA scanning
- Default bind to