MLX Vision

MLX Vision is a Swift library for running computer vision models on Apple Silicon. The library provides a flexible API for various tasks: image classification, object detection, image segmentation, zero-shot classification, zero-shot segmentation, and embedding extraction.

Supported Tasks and Models

Task	Models
Image Classification	ResNet, EfficientNet
Object Detection	DETR, RF-DETR, LW-DETR, RT-DETR
Instance Segmentation	DETR, RF-DETR
Zero-shot Classification	CLIP, SigLIP
Zero-shot Segmentation	SAM 3
Embeddings Extraction	CLIP, SigLIP

Installation

Add the following to your Package.swift file:

dependencies: [
    .package(url: "https://github.com/petrukha-ivan/mlx-swift-vision", from: "0.0.1")
]

Then add the library as a dependency for your targets:

dependencies: [
    .product(name: "MLXVision", package: "mlx-swift-vision")
]

Usage

Load a Model

Model loading is unified. Pass a model id and task type. The factory resolves the model architecture and returns a typed pipeline with input and output types bound to the task.

import MLXVision

let factory = ModelFactory.shared
let model = try await factory.load("microsoft/resnet-50", for: ImageClassificationTask.self)

Model Compression

You can override the default model configuration. For example, with the following configuration, it is possible to run the SAM 3 model even on an iPhone. By default, all models are loaded in bfloat16 format.

let inputSize = CGSize(width: 336.0, height: 336.0)
let overrides = ModelOverrides(inputSize: inputSize, quantizeBits: 4)

let factory = ModelFactory.shared
let model = try await factory.load("facebook/sam3", for: ZeroShotSegmentationTask.self, overrides: overrides)

Run a Model

Image Classification

let request = ImageClassificationRequest(image: image)
let results = try model(request).top(5) // Top 5 items sorted by score
let summary = results.map { "\($0.label) - \($0.score)" } // Labels with scores

Object Detection

let request = ObjectDetectionRequest(image: image)
let result = try model(request).top(1)[0] // Top detection result
let bbox = result.bbox // Normalized CGRect

Instance Segmentation

let request = InstanceSegmentationRequest(image: image)
let result = try model(request).top(1)[0] // Top segmentation result
let mask = result.mask // Grayscale CIImage mask

Zero-shot Classification

let labels = ["cat", "dog", "car"]
let request = ZeroShotClassificationRequest(image: image, labels: labels)
let results = try model(request) // 3 items with scores for each provided label

Zero-shot Segmentation

let prompt = "orange cat"
let request = ZeroShotSegmentationRequest(image: image, prompt: prompt)
let results = try model(request) // N items with masks matching prompt description

Image Annotation

Models return raw results: classification scores, normalized bounding boxes, and segmentation masks. You can process these results manually, or use image annotators:

let request = ObjectDetectionRequest(image: image)
let results = try model(request)

let annotator = BoxAnnotator(lineWidth: 8.0) // See also LabelAnnotator, MaskAnnotator
let annotatedImage = annotator.annotate(image: image, detections: results)

Examples

This repository includes a fully featured iOS/macOS app. You can find more usage examples inside. It includes photo library processing and live-camera processing to test real-time model performance. Build the project in Release configuration to ensure the best performance.

556168490-fec2dec1-207b-4e2d-ae96-21eccc5d7aac

556168534-efe448bd-ca48-45b3-8424-07ca5aace941

556161891-16a0e144-1127-40dc-b10e-4c7ea1bd5d2c

556162008-d23803b5-3a33-4df6-959d-b2a376480587

556162097-5b270022-54ff-485b-b1cb-2ec62117fbc0

556162170-d9e983ee-93ed-4f32-90de-82be07abd07c

Performance

This is not a comprehensive and accurate test, the numbers below are approximate live-camera inference measurements. Models tested with bfloat16 dtype without any quantization.

Metrics on M3 Max:

Model	Input Size	Average Processing Time	Frames per Second
roboflow/rf-detr-nano	384x384	10 ms	102
roboflow/rf-detr-seg-nano	384x384	21 ms	48
facebook/detr-resnet-50	448x448	12 ms	82
facebook/detr-resnet-50-panoptic	448x448	48 ms	20
facebook/sam3	336x336	120 ms	8

Metrics on iPhone 16 Pro Max are obviously lower, but still practical even for interactive use:

Model	Input Size	Average Processing Time	Frames per Second
roboflow/rf-detr-nano	384x384	32 ms	31
roboflow/rf-detr-seg-nano	384x384	85 ms	11
facebook/detr-resnet-50	448x448	40 ms	25
facebook/detr-resnet-50-panoptic	448x448	230 ms	4
facebook/sam3	336x336	500 ms	2

Legal Notes

This library does not redistribute model resources such as weights or tokenizers. You must obtain an access token if model access is limited. While the library has a permissive license, you still have to comply with each model-specific license.

Troubleshooting

This library is in an early stage of development. If you encounter a problem, please create an issue or open a pull request. Contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Examples		Examples
Sources/MLXVision		Sources/MLXVision
Tests/MLXVisionTests/Processing		Tests/MLXVisionTests/Processing
.gitignore		.gitignore
.swift-format		.swift-format
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX Vision

Supported Tasks and Models

Installation

Usage

Load a Model

Model Compression

Run a Model

Image Classification

Object Detection

Instance Segmentation

Zero-shot Classification

Zero-shot Segmentation

Image Annotation

Examples

Performance

Legal Notes

Troubleshooting

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLX Vision

Supported Tasks and Models

Installation

Usage

Load a Model

Model Compression

Run a Model

Image Classification

Object Detection

Instance Segmentation

Zero-shot Classification

Zero-shot Segmentation

Image Annotation

Examples

Performance

Legal Notes

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages