MLX-Knife 2.0

Current Version: 2.0.4 (Stable)

Release Notes: See CHANGELOG.md for detailed changes, fixes, and migration guides.

Features

What's New in 2.0.4

Audio Transcription (STT) - Whisper speech-to-text (--audio flag)
Vision Models with EXIF Metadata - Image analysis + automatic GPS/date/camera extraction visible to the model
Unix Pipe Integration - Chain models without temp files (vision → text workflows)
Local Development Workflow - Clone → Repair → Test models without HuggingFace round-trips
Community Model Repair Tool - Fix broken mlx-vlm models with --repair-index
Resumable Downloads - Interrupted clone/pull operations continue automatically
Safe Vision Batch Processing - Automatic chunking prevents Metal OOM crashes
Workspace Path Support - Run/show/server/list commands work with local directories
Updated SECURITY.md - Clarifies mlx-knife vs. upstream library behavior; important for offline/air-gapped use

Core Functionality

List & Manage Models: Browse your HuggingFace cache with MLX-specific filtering
Model Information: Detailed model metadata including quantization info
Download Models: Pull models from HuggingFace with progress tracking
Run Models: Native MLX execution with streaming and chat modes
Health Checks: Verify model integrity and MLX runtime compatibility
Cache Management: Clean up and organize your model storage
Privacy & Network: No background network or telemetry; only explicit Hugging Face interactions when you run pull or the experimental push.

Unix Pipe Integration (Beta, 2.0.4)

Chain models with standard Unix pipes - no temp files needed:

export MLXK2_ENABLE_PIPES=1

# Model chaining
cat article.txt | mlx-run translator_model - | mlx-run summarizer_model - "3 bullets"

# Works with Unix tools
mlx-run chat_model "explain quicksort" | tee explanation.txt | head -20

Robust handling of SIGPIPE and early pipe termination (| head, | grep -m1).

Requirements

macOS with Apple Silicon
Python 3.10-3.12 (see Python Compatibility below)
8GB+ RAM recommended + RAM to run LLM

⚖️ Model Usage and Licenses

mlx-knife is a tooling layer for running ML models (e.g. from Hugging Face) locally. The project does not distribute any model weights and does not decide which models you use or how you use them.

Please note:

Each model (weights, tokenizer, configuration, etc.) is governed by its own license.
When mlx-knife downloads a model from a third-party service (e.g. Hugging Face), it does so on your behalf.
You are responsible for:
- reading and understanding the license of each model you use,
- complying with any restrictions (e.g. Non-Commercial, Research Only, RAIL, etc.),
- ensuring that your use of a given model (private, research, commercial, on-prem services, etc.) is legally permitted.

The mlx-knife source code itself is provided under the open-source license specified in this repository. This license applies only to the mlx-knife code and does not extend to any external models.

This is not legal advice. Always refer to the original model license text and, if necessary, seek professional legal counsel.

Python Compatibility

✅ Python 3.10 - 3.12 - Full support (Text + Vision + Audio) ❌ Python 3.9 - Use version 2.0.3 (text + cache management only) ❌ Python 3.13+ - Not supported (miniaudio lacks pre-built wheels)

Recommended: Python 3.10 or 3.11 for best compatibility.

Installation

1. PyPI (Recommended)

pip install mlx-knife
mlxk --version  # → mlxk 2.0.4

Requirements: macOS Apple Silicon, Python 3.10-3.12 Includes: Text, Vision, Audio (Whisper STT), EXIF metadata, Unix pipes

2. Developer Installation

git clone https://github.com/mzau/mlx-knife.git
cd mlx-knife
pip install -e ".[dev,test]"

mlxk --version  # → mlxk 2.0.4
pytest -v

Requirements: macOS Apple Silicon, Python 3.10-3.12

Migrating from 1.x

If you're upgrading from MLX Knife 1.x, see MIGRATION.md for important information about the license change (MIT → Apache 2.0) and behavior changes.

Quick Start

# List models (human-readable)
mlxk list
mlxk list --health
mlxk list --verbose --health

# Check cache health
mlxk health

# Show model details
mlxk show "mlx-community/Phi-3-mini-4k-instruct-4bit"

# Pull a model
mlxk pull "mlx-community/Llama-3.2-3B-Instruct-4bit"

# Resume interrupted download (skip prompt)
mlxk pull "model-name" --force-resume

# Run interactive chat
mlxk run "Phi-3-mini" -c

# Start OpenAI-compatible server
mlxk serve --port 8080

Commands

Command	Description
`list`	Model discovery with JSON output; supports cache and workspace paths
`show`	Detailed model information with --files, --config
`health`	Corruption detection and cache analysis
`pull`	HuggingFace model downloads with corruption detection
`rm`	Model deletion with lock cleanup and fuzzy matching
`run`	Interactive and single-shot model execution with streaming/batch modes
`server`/`serve`	OpenAI-compatible API server; SIGINT-robust (Supervisor); SSE streaming
`clone`	Model workspace cloning - create local editable copy from cache
`push`	Upload to HuggingFace Hub (requires `--private` flag for safety)
🔒 `convert`	Experimental - Workspace transformations; requires `MLXK2_ENABLE_ALPHA_FEATURES=1`
🔒 `pipe mode`	Beta feature - Unix pipes with `mlxk run <model> - ...`; requires `MLXK2_ENABLE_PIPES=1`

Model References

MLX-Knife supports multiple ways to reference models:

HuggingFace Models (Cache)

Format	Example	Description
Full name	`mlx-community/Phi-4-4bit`	Exact HuggingFace repo ID
Short name	`Phi-4`	Fuzzy match against cache
With hash	`Phi-4@e96f3b2`	Specific commit/version

mlxk run "mlx-community/Phi-4-4bit" "Hello"
mlxk run "Phi-4" "Hello"                    # Fuzzy match
mlxk show "Qwen3@e96" --json                # Specific version

Local Paths (2.0.4-beta.6+)

Format	Example
Relative	`./my-workspace`
Absolute	`/Volumes/External/model`
Prefix match	`./gemma-` (all workspaces starting with "gemma-")
Directory	`.` (all workspaces in current directory)

# List workspaces
mlxk list .                        # All workspaces in current directory
mlxk list ./gemma-                 # Prefix match: gemma-3n-4bit, gemma-3n-FIXED-4bit, ...
mlxk list $PWD/models              # Absolute path → absolute output

# Clone → Run
mlxk clone org/model ./workspace
mlxk run ./workspace "Hello"

# Convert → Run
mlxk convert ./broken ./fixed --repair-index
mlxk run ./fixed "Test"

Output format: List output mirrors input format - relative patterns produce relative names (like ls), absolute paths produce absolute names.

Disambiguating paths vs cache names: When a local directory exists with the same name as a cached model, use ./ prefix to force workspace resolution. Otherwise, cache lookup is attempted first.

Workspace Development Workflow (2.0.4-beta.6+)

Complete local development cycle for model experimentation, repair, and testing without HuggingFace round-trips:

# Clone → Repair → Test → Publish (optional)
mlxk clone "model" ./workspace
MLXK2_ENABLE_ALPHA_FEATURES=1 mlxk convert ./workspace ./fixed --repair-index
mlxk list .                                        # See all local workspaces
mlxk run ./fixed "test prompt"                     # Local inference
mlxk server --model ./fixed                        # Dev server
mlxk push ./fixed "your-org/model"                 # Optional publish

Key capabilities:

Model repair: Fix index/shard mismatches from mlx-vlm #624
Local testing: Run/server/show without pushing to HuggingFace
Side-by-side comparison: Multiple workspaces with parallel servers
Rapid iteration: Clone → Modify → Test loop

Workspace Commands Reference

Command	Workspace Support	Example
`run`	✅ Yes	`mlxk run ./workspace "prompt"`
`show`	✅ Yes	`mlxk show ./workspace --files`
`health`	✅ Yes	`mlxk health ./workspace`
`server`	✅ Yes	`mlxk server --model ./workspace`
`clone`	✅ Creates	`mlxk clone model ./workspace`
`convert`	✅ Yes	`mlxk convert ./in ./out --repair-index`
`push`	✅ Yes	`mlxk push ./workspace "org/name"`
`list`	✅ Yes	`mlxk list .` or `mlxk list ./gemma-`
`pull`	❌ Cache only	Downloads to HuggingFace cache
`rm`	❌ Cache only	Use `rm -rf ./workspace` for local directories

Web Interface

For a web-based chat UI, use nChat - a lightweight web interface for the BROKE ecosystem:

# Clone once (local setup):
git clone https://github.com/mzau/broke-nchat.git
cd broke-nchat

# Start mlx-knife server:
mlxk serve

# Open web UI:
open webui/index.html

On-Prem: Pure HTML/CSS/JS - runs entirely locally, zero dependencies.

Note: nChat is a separate project designed for the entire BROKE ecosystem (MLX Knife + BROKE Cluster). See nChat README for CORS configuration.

Multi-Modal Support

MLX Knife supports multiple input modalities beyond text. All multi-modal features share a common output pattern: model responses are followed by collapsible metadata tables for transparency and traceability.

Vision (Beta)

Image analysis via the --image flag (CLI and server). Requires Python 3.10+.

Requirements

Python 3.10+ (mlx-vlm dependency)
Backend: mlx-vlm 0.3.10 (included in base install)

Usage

# Image analysis with custom prompt
mlxk run "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit" \
  --image photo.jpg "Describe what you see in detail"

# Multiple images (space-separated or glob)
mlxk run vision-model --image img1.jpg img2.jpg img3.jpg "Compare these images"
mlxk run vision-model --image photos/*.jpg "Which images show outdoor scenes?"

# Auto-prompt (default: "Describe the image.")
mlxk run vision-model --image cat.jpg

# Text-only on vision model (no --image flag)
mlxk run "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit" "What is 2+2?"

Batch Processing

Terminology Note: mlx-knife uses "batch" in the traditional computing sense (sequential job processing in groups), not ML inference batching (parallel batch_size > 1 in a single forward pass). Images are processed sequentially in groups for memory safety, not performance parallelization.

Breaking Change (2.0.4-beta.6): Vision processing now defaults to processing one image at a time for maximum stability on all systems. Use --chunk N to process multiple images per batch when your system can handle it.

# Default: one image at a time (most robust, automatic chunking)
mlxk run pixtral "Describe image" --image photos/*.jpg

# Faster: 5 images per batch (requires more RAM, may trigger model-specific issues)
mlxk run pixtral "Describe images" --chunk 5 --image photos/*.jpg

# Alternative: Use --prompt flag (useful when experimenting with different prompts)
mlxk run pixtral --chunk 5 --image photos/*.jpg --prompt "Describe images"

# Set default chunk size via environment variable
export MLXK2_VISION_CHUNK_SIZE=3
mlxk run pixtral "Describe images" --image photos/*.jpg

Why chunking?

Safety: Prevents Metal OOM crashes by limiting images per processing group (--chunk N)
Isolation: Fresh inference session per chunk (KV cache cleared, conversation context reset)
Trade-off: ~2-3s model load overhead per chunk vs guaranteed isolation

Reliability (2.0.4-beta.7): Vision models can sometimes describe details they didn't actually see. MLX Knife prevents this automatically:

Default (chunk=1): Most reliable - each image processed independently
Larger chunks: Still safe, but models may occasionally confuse details between images in the same batch

For maximum accuracy, use the default chunk=1 (no configuration needed).

Server API:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "pixtral", "chunk": 3, "messages": [...]}'

Note: chunk is an mlx-knife extension parameter. See SERVER-HANDBOOK.md for details.

Metadata Output Format

When processing images, MLX Knife automatically prepends metadata in a collapsible table (collapsed by default) before the model output:

<details>
<summary>📸 Chunk 1/3: Images 1-4</summary>

| Image | Filename | Original | Location | Date | Camera |
|-------|----------|----------|----------|------|--------|
| 1 | image_abc123.jpeg | beach.jpg | 📍 32.7900°N, 16.9200°W | 📅 2023-12-06 12:19 | 📷 Apple iPhone SE |
| 2 | image_def456.jpeg | mountain.jpg | 📍 32.8700°N, 17.1700°W | 📅 2023-12-10 15:42 | 📷 Apple iPhone SE |
| 3 | image_xyz789.jpeg | sunset.jpg | 📍 32.8200°N, 17.0500°W | 📅 2023-12-08 18:30 | 📷 Apple iPhone SE |
| 4 | image_uvw456.jpeg | forest.jpg | 📍 32.8800°N, 17.1200°W | 📅 2023-12-09 10:15 | 📷 Apple iPhone SE |

</details>

A beach with palm trees and clear blue water. A mountain landscape with snow-capped peaks...

Chunk information in summary:

Shows current chunk and total chunks (e.g., "Chunk 1/3")
Shows image range in current chunk (e.g., "Images 1-4")
Helps track progress in WebUI and prevents confusion about which images are being described

Why metadata comes first:

The model sees GPS, date, and camera info when analyzing images (enables location/time-aware descriptions)
The markdown table shows you exactly what the model knows about each image
Helps verify which description belongs to which file

Metadata includes:

Image ID → Filename mapping (identify which description belongs to which file)
GPS coordinates (latitude/longitude, if available in EXIF)
- Precision: 4 decimal places (~11m accuracy) for street-level context
Capture date/time (ISO 8601 format)
Camera model (device info)

Privacy control:

EXIF extraction is enabled by default. To disable (e.g., for privacy-sensitive images):

export MLXK2_EXIF_METADATA=0
mlxk run vision-model --image photo.jpg "describe"

Output is the same for CLI and server - metadata tables work in terminals, web UIs (nChat), and can be parsed programmatically.

Limitations

Image limits: Model-dependent due to Metal / unified-memory constraints and peak activation usage
- pixtral-12b-8bit: Up to 5 images tested on M2 Max 64GB (multi-image capable)
- Llama-3.2-11B / Other models: Single-image only
- Larger models (24B+): Limited to 1-2 images on 64GB RAM
- Default server guardrails: 20 MB per image, 50 MB total (configurable). Base64 encoding adds ~33% overhead.

Server API

Vision models work with OpenAI-compatible /v1/chat/completions endpoint using base64-encoded images:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "llama-vision",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What is in this image?"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
    ]
  }]
}'

Model Compatibility

⚠️ Important: Vision support relies on mlx-vlm (upstream), which has known stability issues. While mlxk health verifies file integrity, runtime failures may occur with certain model architectures due to upstream bugs.

✅ Tested & Working Models (mlx-knife v2.0.4):

Model	Size	Notes
`mlx-community/pixtral-12b-8bit`	~13.5GB	⭐ Recommended: Excellent text recognition, multi-image support (5+ images on M2 Max 64GB); beta.4+ for text-only
`mlx-community/Llama-3.2-11B-Vision-Instruct-4bit`	~6.5GB	Reliable, good quality; single-image only
`Devstral-Small-2-24B-Instruct-2512-6bit`	~14GB	Single-image only on 64GB RAM; requires `--repair-index` (mlx-vlm #624); better for Mac Studio/Ultra

❌ Known Issues (as of 2026-01-03):

Model	Issue	Workaround
`Mistral-Small-3.1-24B-Instruct-2503-4bit`	Vision feature mismatch (5476 positions ≠ 1369 features). Note: This is a different bug than mlx-vlm #624 - `--repair-index` cannot fix it	Use alternative models (pixtral-12b-8bit, Llama-3.2-11B)
`MiMo-VL-7B-RL-bf16`	NoneType iteration error	mlx-vlm processor bug, no workaround
`DeepSeek-OCR-8bit`	Runs but hallucinates details	Quality issue, not recommended

Models affected by mlx-vlm #624 (index/shard mismatch) can be repaired:

mlxk convert <model> <output> --repair-index

Recommendation: Test models with real images before production use. The vision ecosystem is evolving rapidly - this list will be updated as mlx-vlm matures.

Reporting Issues: If you encounter vision model failures, please report with model name and error message to help improve compatibility tracking.

Audio Transcription (Speech-to-Text)

🎙️ Audio Transcription: Speech-to-text via Whisper models (mlx-audio backend). Works out-of-the-box with PyPI install. Backward compatible with Gemma-3n multimodal audio (mlx-vlm).

Requirements:

Python 3.10+ (mlx-audio dependency, included in base install)
No system dependencies: MP3/WAV decoding via embedded libsndfile (no ffmpeg or Homebrew required)

✅ Recommended Models (mlx-knife v2.0.4):

Model	Backend	Size	Duration	Notes
`whisper-large-v3-turbo-4bit`	mlx-audio	~464MB	>10 min	Recommended - Best accuracy/speed balance
`whisper-tiny`	mlx-audio	~74MB	>10 min	Fast, lower accuracy
`gemma-3n-E2B-it-4bit`	mlx-vlm	~2.1GB	~30s	Multimodal (vision+audio), requires workspace repair

🔧 Backend Architecture:

mlx-knife automatically routes audio models to the optimal backend:

Whisper/Voxtral → mlx-audio (dedicated STT, >10min duration, best accuracy)
Gemma-3n → mlx-vlm (multimodal audio, ~30s limit, backward compatible)

⚙️ Audio Defaults:

Setting	Audio	Text/Vision	Reason
Temperature	0.0	0.7	Greedy decoding (STT best practice)
Default Prompt	"Transcribe this audio."	-	Minimal prompt for pure transcription

💡 Quick Start:

# Pull a Whisper model (one-time setup)
mlxk pull mlx-community/whisper-large-v3-turbo-4bit

# Transcribe audio (WAV, MP3, M4A - native on macOS)
mlxk run whisper-large --audio speech.mp3
# → Automatic greedy decoding (temp=0.0)

# With language hint for better accuracy
mlxk run whisper-large --audio speech.mp3 --language en

# Longer audio (>10 minutes supported)
mlxk run whisper-large --audio podcast.wav

⚠️ Known Limitations:

Limitation	Whisper Models	Gemma-3n (Multimodal)	Workaround
Duration	>10 minutes ✅	~30 seconds (token limit)	Use Whisper for long audio
File size	50MB max	50MB max	Split larger files
Formats	WAV, MP3, M4A (macOS native, Linux needs ffmpeg)	WAV	M4A uses Core Audio on macOS
Legacy models	Some use old `weights.npz` format	-	Use models with `.safetensors`

🎯 Advanced Usage:

# Explicit temperature control (0.0 = greedy, deterministic)
mlxk run whisper-large --audio speech.wav --temperature 0.0

# Force specific language (improves accuracy)
mlxk run whisper-large --audio german.mp3 --language de

# Segment metadata (MLXK2_AUDIO_SEGMENTS=1 for timestamps)
MLXK2_AUDIO_SEGMENTS=1 mlxk run whisper-large --audio meeting.wav

🔄 Gemma-3n Multimodal (Backward Compatibility):

Note: Gemma-3n requires workspace repair due to mlx-vlm #624. Use Whisper for production STT.

# One-time setup (if using Gemma-3n)
MLXK2_ENABLE_ALPHA_FEATURES=1
mlxk clone mlx-community/gemma-3n-E2B-it-4bit ./gemma-3n-audio
mlxk convert ./gemma-3n-audio ./gemma-3n-audio-FIXED --repair-index

# Run (30s limit, multimodal audio)
mlxk run ./gemma-3n-audio-FIXED --audio short-clip.wav

JSON API

📋 Complete API Specification: See JSON API Specification for comprehensive schema, error codes, and examples.

All commands support both human-readable and JSON output (--json flag) for automation and scripting, enabling seamless integration with CI/CD pipelines and cluster management systems.

Command Structure

All commands support JSON output via --json flag:

mlxk list --json | jq '.data.models[].name'
mlxk health --json | jq '.data.summary'
mlxk show "Phi-3-mini" --json | jq '.data.model'

Response Format:

{
    "status": "success|error",
    "command": "list|health|show|pull|rm|clone|version|push|run|server",
    "data": { /* command-specific data */ },
    "error": null | { "type": "...", "message": "..." }
}

Examples

List Models

mlxk list --json
# Output:
{
  "status": "success",
  "command": "list",
  "data": {
    "models": [
      {
        "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
        "hash": "a5339a41b2e3abcdef1234567890ab12345678ef",
        "size_bytes": 4613734656,
        "last_modified": "2024-10-15T08:23:41Z",
        "framework": "MLX",
        "model_type": "chat",
        "capabilities": ["text-generation", "chat"],
        "health": "healthy",
        "runtime_compatible": true,
        "reason": null,
        "cached": true
      }
    ],
    "count": 1
  },
  "error": null
}

Health Check

mlxk health --json
# Output:
{
  "status": "success",
  "command": "health",
  "data": {
    "healthy": [
      {
        "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
        "status": "healthy",
        "reason": "Model is healthy"
      }
    ],
    "unhealthy": [],
    "summary": { "total": 1, "healthy_count": 1, "unhealthy_count": 0 }
  },
  "error": null
}

Show Model Details

mlxk show "Phi-3-mini" --json --files
# Output (simplified):
{
  "status": "success",
  "command": "show",
  "data": {
    "model": {
      "name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
      "hash": "a5339a41b2e3abcdefgh1234567890ab12345678",
      "size_bytes": 4613734656,
      "framework": "MLX",
      "model_type": "chat",
      "capabilities": ["text-generation", "chat"],
      "last_modified": "2024-10-15T08:23:41Z",
      "health": "healthy",
      "runtime_compatible": true,
      "reason": null,
      "cached": true
    },
    "files": [
      {"name": "config.json", "size": "1.2KB", "type": "config"},
      {"name": "model.safetensors", "size": "2.3GB", "type": "weights"}
    ],
    "metadata": null
  },
  "error": null
}

Integration Examples

Broke-Cluster Integration

# Get available model names for scheduling
MODELS=$(mlxk list --json | jq -r '.data.models[].name')

# Check cache health before deployment
HEALTH=$(mlxk health --json | jq '.data.summary.healthy_count')
if [ "$HEALTH" -eq 0 ]; then
    echo "No healthy models available"
    exit 1
fi

# Download required models
mlxk pull "mlx-community/Phi-3-mini-4k-instruct-4bit" --json

CI/CD Pipeline Usage

# Verify model integrity in CI
mlxk health --json | jq -e '.data.summary.unhealthy_count == 0'

# Clean up CI artifacts
mlxk rm "test-model-*" --json --force

# Pre-warm cache for deployment
mlxk pull "production-model" --json

Model Management Automation

# Find models by pattern
LARGE_MODELS=$(mlxk list --json | jq -r '.data.models[] | select(.name | contains("30B")) | .name')

# Show detailed info for analysis
for model in $LARGE_MODELS; do
    mlxk show "$model" --json --config | jq '.data.model_config'
done

Human Output

MLX Knife provides rich human-readable output by default (without --json flag).

Error Handling (2.0.3+): Errors print to stderr for clean pipe workflows:

mlxk show badmodel | grep ...      # Errors don't contaminate stdout
mlxk pull badmodel > log 2> err    # Capture errors separately

Basic Usage

mlxk list
mlxk list --health
mlxk health
mlxk show "mlx-community/Phi-3-mini-4k-instruct-4bit"
mlxk pull "mlx-community/Llama-3.2-3B-Instruct-4bit"

Pull Command

Download models from HuggingFace:

mlxk pull "mlx-community/Phi-3-mini-4k-instruct-4bit"

Interrupted downloads (2.0.4-beta.5+): If a download fails (network issue, Ctrl-C), mlxk pull will detect this and prompt to resume:

$ mlxk pull "model-name"
Model 'model-name' has partial download:
  No model weights found. Use --force-resume to attempt resume or 'mlxk rm' to delete.
Resume download? [Y/n]: y

Automation/scripting: Use --force-resume to skip the prompt:

mlxk pull "model-name" --force-resume

List Filters

list: Shows MLX chat models only (compact names, safe default)
list --verbose: Shows all MLX models (chat + base) with full org/names and Framework column
list --all: Shows all frameworks (MLX, GGUF, PyTorch)
Flags are combinable: --all --verbose, --all --health, --verbose --health

Health Status Display (--health flag)

The --health flag adds health status information to the output:

Compact mode (default, --all):

Shows single "Health" column with values:
- healthy - File integrity OK and MLX runtime compatible
- healthy* - File integrity OK but not MLX runtime compatible (use --verbose for details)
- unhealthy - File integrity failed or unknown format

Verbose mode (--verbose --health):

Splits into "Integrity" and "Runtime" columns:
- Integrity: healthy / unhealthy
- Runtime: yes / no / - (dash = gate blocked by failed integrity)
- Reason: Explanation when problems detected (wrapped at 26 chars for readability)

Examples:

# Compact health view
mlxk list --health
# Output:
# Name                    | Hash    | Size   | Modified | Type | Health
# Llama-3.2-3B-Instruct   | a1b2c3d | 2.1GB  | 2d ago   | chat | healthy
# Qwen2-7B-Instruct       | 1a2b3c4 | 4.8GB  | 3d ago   | chat | healthy*

# Verbose health view with details
mlxk list --verbose --health
# Output:
# Name                    | Hash    | Size   | Modified | Framework | Type | Integrity | Runtime | Reason
# Llama-3.2-3B-Instruct   | a1b2c3d | 2.1GB  | 2d ago   | MLX       | chat | healthy   | yes     | -
# Qwen2-7B-Instruct       | 1a2b3c4 | 4.8GB  | 3d ago   | PyTorch   | chat | healthy   | no      | Incompatible: PyTorch

# All frameworks with health status
mlxk list --all --health
# Output:
# Name                    | Hash    | Size   | Modified | Framework | Type    | Health
# Llama-3.2-3B-Instruct   | a1b2c3d | 2.1GB  | 2d ago   | MLX       | chat    | healthy
# llama-3.2-gguf-q4       | b2c3d4e | 1.8GB  | 3d ago   | GGUF      | unknown | healthy*
# broken-download         | -       | 500MB  | 1h ago   | Unknown   | unknown | unhealthy

Design Philosophy:

unhealthy is a catch-all for anything not understood/supported (broken downloads, unknown formats, creative HuggingFace structures)
healthy guarantees the model will work with mlxk2 run
healthy* means files are intact but MLX runtime can't execute them (e.g., GGUF/PyTorch models, incompatible model_type, or mlx-lm version too old)

Note: JSON output is unaffected by these human-only filters and always includes full health/runtime data.

Logging & Debugging

MLX Knife 2.0 provides structured logging with configurable output formats and levels.

Log Levels

Control verbosity with --log-level (server mode):

# Default: Show startup, model loading, and errors
mlxk serve --log-level info

# Quiet: Only warnings and errors
mlxk serve --log-level warning

# Silent: Only errors
mlxk serve --log-level error

# Verbose: All logs including HTTP requests
mlxk serve --log-level debug

Log Level Behavior:

debug: All logs + Uvicorn HTTP access logs (GET /v1/models, etc.)
info: Application logs (startup, model switching, errors) + HTTP access logs
warning: Only warnings and errors (no startup messages, no HTTP access logs)
error: Only error messages

JSON Logs (Machine-Readable)

Enable structured JSON output for log aggregation tools:

# JSON logs (recommended - CLI flag)
mlxk serve --log-json

# JSON logs (alternative - environment variable)
MLXK2_LOG_JSON=1 mlxk serve

Note: --log-json also formats Uvicorn access logs as JSON for consistent output.

JSON Format:

{"ts": 1760830072.96, "level": "INFO", "msg": "MLX Knife Server 2.0 starting up..."}
{"ts": 1760830073.14, "level": "INFO", "msg": "Switching to model: mlx-community/...", "model": "..."}
{"ts": 1760830074.52, "level": "ERROR", "msg": "Model type bert not supported.", "logger": "root"}

Fields:

ts: Unix timestamp
level: Log level (INFO, WARN, ERROR, DEBUG)
msg: Log message (HF tokens and user paths automatically redacted)
logger: Source logger (mlxk2 = application, root = external libraries like mlx-lm)
Additional fields: model, request_id, detail, duration_ms (context-dependent)

Security: Automatic Redaction

Sensitive data is automatically removed from logs:

HuggingFace tokens (hf_...) → [REDACTED_TOKEN]
User home paths (/Users/john/...) → ~/...

Example:

# Original (unsafe):
Using token hf_AbCdEfGhIjKlMnOpQrStUvWxYz123456 from /Users/john/models

# Logged (safe):
Using token [REDACTED_TOKEN] from ~/models

Configuration Reference

MLX Knife supports comprehensive runtime configuration via environment variables. All settings can be controlled without code changes.

Feature Gates

Enable experimental and alpha features:

Variable	Description	Default	Since
`MLXK2_ENABLE_ALPHA_FEATURES`	Enable alpha commands (`convert`)	`0` (disabled)	2.0.0
`MLXK2_ENABLE_PIPES`	Enable Unix pipe integration (`mlxk run <model> -`)	`0` (disabled)	2.0.4
`MLXK2_EXIF_METADATA`	Extract EXIF metadata from images (Vision models)	`1` (enabled)	2.0.4

Examples:

# Enable pipe mode for stdin processing
export MLXK2_ENABLE_PIPES=1
echo "Hello" | mlxk run model - "translate to Spanish"

# Disable EXIF extraction for privacy (enabled by default)
export MLXK2_EXIF_METADATA=0
mlxk run vision-model --image photo.jpg "describe this"

# Enable alpha features for convert command
export MLXK2_ENABLE_ALPHA_FEATURES=1
mlxk convert ./broken ./fixed --repair-index

Server Configuration

Control server behavior without command-line flags:

Variable	Description	Default	Since
`MLXK2_HOST`	Server bind address	`127.0.0.1`	2.0.0
`MLXK2_PORT`	Server port	`8000`	2.0.0
`MLXK2_PRELOAD_MODEL`	Model to load at startup (set by `--model` flag)	(none)	2.0.0-beta
`MLXK2_MAX_TOKENS`	Override default max_tokens for all requests	(auto)	2.0.4
`MLXK2_RELOAD`	Enable Uvicorn auto-reload (development only)	`0` (disabled)	2.0.0

Vision Processing

Control vision model behavior (Python 3.10+, beta):

Variable	Description	Default	Since
`MLXK2_VISION_CHUNK_SIZE`	Default chunk size for vision image processing	`1`	2.0.4-beta.7

Examples:

# Process 3 images per chunk instead of 1 (faster but requires more RAM)
export MLXK2_VISION_CHUNK_SIZE=3
mlxk run pixtral --image photos/*.jpg "Describe images"

# CLI flag overrides environment variable
mlxk run pixtral --chunk 5 --image photos/*.jpg "Describe images"  # Uses 5, not 3

Server Configuration Examples

# Custom host/port binding
MLXK2_HOST=0.0.0.0 MLXK2_PORT=9000 mlxk serve

# Preload model for faster first request
MLXK2_PRELOAD_MODEL="mlx-community/Qwen2.5-3B-Instruct-4bit" mlxk serve

# Override max_tokens for all requests
MLXK2_MAX_TOKENS=4096 mlxk serve

# Development mode with auto-reload
MLXK2_RELOAD=1 mlxk serve

Logging Configuration

Control log output format and verbosity:

Variable	Description	Default	Since
`MLXK2_LOG_JSON`	Enable JSON log format	`0` (text)	2.0.0
`MLXK2_LOG_LEVEL`	Log level (`debug`, `info`, `warning`, `error`)	`info`	2.0.0

Examples:

# JSON logs for log aggregation tools
MLXK2_LOG_JSON=1 mlxk serve

# Quiet mode (warnings and errors only)
MLXK2_LOG_LEVEL=warning mlxk serve

# Verbose debug output
MLXK2_LOG_LEVEL=debug mlxk serve

Note: CLI flags (--log-json, --log-level) take precedence over environment variables.

HuggingFace Integration

Control HuggingFace Hub authentication and cache:

Variable	Description	Default	Since
`HF_HOME`	HuggingFace cache directory	`~/.cache/huggingface`	N/A
`HF_TOKEN`	HuggingFace API token (for private models, `push`)	(none)	N/A
`HUGGINGFACE_HUB_TOKEN`	Alternative token variable (fallback)	(none)	N/A

Examples:

# Custom cache location
HF_HOME=/data/models mlxk list

# Authentication for private models
HF_TOKEN=hf_... mlxk pull org/private-model

# Upload to HuggingFace Hub (requires MLXK2_ENABLE_ALPHA_FEATURES=1)
HF_TOKEN=hf_... mlxk push ./workspace org/model --private

Configuration Priority

When multiple sources define the same setting, precedence order is:

CLI flags (highest priority) - e.g., --log-json, --port
Environment variables - e.g., MLXK2_LOG_JSON=1
Defaults (lowest priority) - documented above

Example:

# CLI flag wins over environment variable
MLXK2_PORT=9000 mlxk serve --port 8080  # Uses port 8080, not 9000

HuggingFace Cache Safety

MLX-Knife 2.0 respects standard HuggingFace cache structure and practices:

Best Practices for Shared Environments

Read operations (list, health, show) always safe with concurrent processes
Write operations (pull, rm) coordinate during maintenance windows
Lock cleanup automatic but avoid during active downloads
Your responsibility: Coordinate with team, use good timing

Example Safe Workflow

# Check what's in cache (always safe)
mlxk list --json | jq '.data.count'

# Maintenance window - coordinate with team
mlxk rm "corrupted-model" --json --force
mlxk pull "replacement-model" --json

# Back to normal operations
mlxk health --json | jq '.data.summary'

Workspace Features: `clone`, `push`, `convert`

Workspace Structure

A workspace is a self-contained directory containing model files in a flat structure (not the HuggingFace cache format). Workspaces are portable, editable, and can be health-checked standalone.

Structure:

workspace/
├── config.json              # Model configuration
├── tokenizer.json           # Tokenizer definition
├── tokenizer_config.json    # Tokenizer settings
├── model.safetensors        # Weights (single file)
├── (or model-*.safetensors) # Weights (multi-shard)
└── README.md                # Optional documentation

Key characteristics:

Aspect	Workspace	HuggingFace Cache
Structure	Flat, self-contained	Nested (hub/models--org--repo/snapshots/...)
Models	Exactly one model per workspace	Many models (models--org--repo1, models--org--repo2, ...)
Purpose	Portable working directory	Download cache (managed)
Health Check	Standalone (no cache needed)	Requires cache structure
Portability	Goal: USB stick, SMB share, any volume	Fixed location (HF_HOME)
Ownership	User owns files	Managed by HuggingFace Hub
Operations	`clone` (creates), `push` (uploads from)	`pull` (downloads to)

Portability (Phase 1 limitation):

Current: Same APFS volume as cache (CoW optimization)
Community Goal: Any location (USB stick, SMB share, different volumes)
Future: Cross-volume support planned

Typical workflow:

mlxk pull org/model → Downloads to cache
mlxk clone org/model workspace/ → Creates editable workspace copy
Edit files in workspace/ (modify config, quantize, etc.)
mlxk push workspace/ org/new-model → Upload modified version
(Optional) Copy workspace to USB stick for sharing

`clone` - Model Workspace Creation

mlxk clone creates a local workspace from a cached model for modification and development.

Creates isolated workspace from cached models
Supports APFS copy-on-write optimization on same-volume scenarios
Includes health check integration for workspace validation
Resumable: Interrupted pulls resume automatically
Use case: Fork-modify-push workflows

Example:

# Clone model to workspace
mlxk clone org/model ./workspace

`push` - Upload to Hub

mlxk push uploads a local folder to a Hugging Face model repository using huggingface_hub/upload_folder.

Requires HF_TOKEN (write-enabled).
Default branch: main (explicitly override with --branch).
Safety: --private is required to avoid accidental public uploads.
No validation or manifests. Basic hard excludes are applied by default: .git/**, .DS_Store, __pycache__/, common virtualenv folders (.venv/, venv/), and *.pyc.
.hfignore (gitignore-like) in the workspace is supported and merged with the defaults.
Repo creation: use --create if the target repo does not exist; harmless on existing repos. Missing branches are created during upload.
JSON output: includes commit_sha, commit_url, no_changes, uploaded_files_count (when available), local_files_count (approx), change_summary and a short message.
Quiet JSON by default: with --json (without --verbose) progress bars/console logs are suppressed; hub logs are still captured in data.hf_logs.
Human output: derived from JSON; add --verbose to include extras such as the commit URL or a short message variant. JSON schema is unchanged.
Local workspace check: use --check-only to validate a workspace without uploading. Produces workspace_health in JSON (no token/network required).
Dry-run planning: use --dry-run to compute a plan vs remote without uploading. Returns dry_run: true, dry_run_summary {added, modified:null, deleted}, and sample added_files/deleted_files.
Testing: see TESTING.md ("Push Testing (2.0)") for offline tests and opt-in live checks with markers/env.
Carefully review the result on the Hub after pushing.
Responsibility: You are responsible for complying with Hugging Face Hub policies and applicable laws (e.g., copyright/licensing) for any uploaded content.

Example:

# Upload to private repo
mlxk push --private ./workspace org/model --create --commit "init"

`convert` - Workspace Transformations (Experimental)

mlxk convert is experimental and requires MLXK2_ENABLE_ALPHA_FEATURES=1. Currently implements --repair-index for fixing safetensors index mismatches (mlx-vlm #624). Future modes like --quantize are planned but not yet implemented.

Use case: Repair models affected by mlx-vlm #624 conversion bug (7+ mlx-community Vision models).

Workflow:

# Enable alpha features (required for clone)
export MLXK2_ENABLE_ALPHA_FEATURES=1

# Clone affected model to workspace
mlxk clone mlx-community/Qwen2.5-VL-7B-Instruct-4bit ./ws-qwen

# Repair safetensors index (no weights changed)
mlxk convert ./ws-qwen ./ws-qwen-fixed --repair-index

# Verify health
mlxk health ./ws-qwen-fixed  # Should report healthy

Affected models (mlx-vlm #624):

Qwen2.5-VL-7B-Instruct-4bit
gemma-3-27b-it-4bit
Mistral-Small-3.1-24B-Instruct-2503-4bit
DeepSeek-OCR-4bit
Devstral-Small-2-24B-Instruct-2512-6bit
(7+ models total)

Key features:

Cache sanctity: Hard blocks writes to HF cache (workspaces only)
Workspace-to-workspace: Source can be managed or unmanaged, output always managed
Health check integration: Automatic validation (skip with --skip-health)
APFS CoW: Instant, space-efficient cloning via cp -c

Future modes: --quantize <bits> (text models), --dequantize (planned).

`pipe mode` - stdin for `run` (beta, `mlx-run` shorthand)

Pipe mode is beta (feature complete) and requires MLXK2_ENABLE_PIPES=1. It lets mlxk run (and mlx-run) read stdin when you pass - as the prompt.

Status: Beta (feature complete), API stable (syntax will not change)
Gate: MLXK2_ENABLE_PIPES=1 (will become default in a future stable release)
Auto-batch: When stdout is a pipe (non-TTY), streaming is disabled automatically for clean output
Robust: Handles SIGPIPE and BrokenPipeError gracefully (| head, | grep -m1 work correctly)
Scope: Applies to mlxk run and mlx-run; other commands unchanged
Usage examples (replace <model> with a cached MLX chat model):

# stdin + trailing text (batch when piped)
MLXK2_ENABLE_PIPES=1 echo "from stdin" | mlxk run "<model>" - "append extra context"

# list → run summarization
MLXK2_ENABLE_PIPES=1 mlxk list --json \
  | MLXK2_ENABLE_PIPES=1 mlxk run "<model>" - "Summarize the model list as a concise table." >my-hf-table.md

# Wrapper shorthand
MLXK2_ENABLE_PIPES=1 mlx-run "<model>" - "translate into german" < README.md

# Vision → Text chain: Photo tour review
MLXK2_ENABLE_PIPES=1 mlxk run pixtral --image photos/*.jpg "Describe each picture" \
  | MLXK2_ENABLE_PIPES=1 mlxk run qwen3 - \
    "Write a tour review. Create a table with picture names, metadata, and descriptions." \
  > tour-review.md

Testing

The 2.0 test suite runs by default (pytest discovery points to tests_2.0/):

# Run 2.0 tests (default)
pytest -v

# Explicitly run legacy 1.x tests (not maintained on this branch)
pytest tests/ -v

# Test categories (2.0 example):
# - ADR-002 edge cases
# - Integration scenarios
# - Model naming logic
# - Robustness testing

# Current status: all current 2.0 tests pass (some optional schema tests may be skipped without extras)

Test Architecture:

Isolated Cache System - Zero risk to user data
Atomic Context Switching - Production/test cache separation
Mock Models - Realistic test scenarios
Edge Case Coverage - All documented failure modes tested

Compatibility Notes

Streaming note: Some UIs buffer SSE; verify real-time with curl -N. Server sends clear interrupt markers on abort.

Contributing

This branch follows the established MLX-Knife development patterns:

# Run quality checks
python test-multi-python.sh  # Tests across Python 3.9-3.14
./run_linting.sh             # Code quality validation

# Key files:
mlxk2/                       # 2.0.0 implementation
tests_2.0/                   # 2.0 test suite
docs/ADR/                    # Architecture decision records

See CONTRIBUTING.md for detailed guidelines.

Support & Feedback

Issues: GitHub Issues
Discussions: GitHub Discussions
API Specification: JSON API Specification
Documentation: See docs/ directory for technical details
Security Policy: See SECURITY.md

License

Apache License 2.0 — see LICENSE (root) and mlxk2/NOTICE.

Acknowledgments

Built for Apple Silicon using the MLX framework
Models hosted by the MLX Community on HuggingFace
Inspired by ollama's user experience

Made with ❤️ by The BROKE team
Version 2.0.4 | February 2026
💬 Web UI: nChat - lightweight chat interface • 🔮 Multi-node: BROKE Cluster

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MLX-Knife 2.0

Features

What's New in 2.0.4

Core Functionality

Unix Pipe Integration (Beta, 2.0.4)

Requirements

⚖️ Model Usage and Licenses

Python Compatibility

Installation

1. PyPI (Recommended)

2. Developer Installation

Migrating from 1.x

Quick Start

Commands

Model References

HuggingFace Models (Cache)

Local Paths (2.0.4-beta.6+)

Workspace Development Workflow (2.0.4-beta.6+)

Workspace Commands Reference

Web Interface

Multi-Modal Support

Vision (Beta)

Requirements

Usage

Batch Processing

Metadata Output Format

Limitations

Server API

Model Compatibility

Audio Transcription (Speech-to-Text)

JSON API

Command Structure

Examples

List Models

Health Check

Show Model Details

Integration Examples

Broke-Cluster Integration

CI/CD Pipeline Usage

Model Management Automation

Human Output

Basic Usage

Pull Command

List Filters

Health Status Display (--health flag)

Logging & Debugging

Log Levels

JSON Logs (Machine-Readable)

Security: Automatic Redaction

Configuration Reference

Feature Gates

Server Configuration

Vision Processing

Server Configuration Examples

Logging Configuration

HuggingFace Integration

Configuration Priority

HuggingFace Cache Safety

Best Practices for Shared Environments

Example Safe Workflow

Workspace Features: clone, push, convert

Workspace Structure

clone - Model Workspace Creation

push - Upload to Hub

convert - Workspace Transformations (Experimental)

pipe mode - stdin for run (beta, mlx-run shorthand)

Testing

Compatibility Notes

Contributing

Support & Feedback

License

Acknowledgments

Workspace Features: `clone`, `push`, `convert`

`clone` - Model Workspace Creation

`push` - Upload to Hub

`convert` - Workspace Transformations (Experimental)

`pipe mode` - stdin for `run` (beta, `mlx-run` shorthand)