MLX Vision is a Swift library for running computer vision models on Apple Silicon. The library provides a flexible API for various tasks: image classification, object detection, image segmentation, zero-shot classification, zero-shot segmentation, and embedding extraction.
| Task | Models |
|---|---|
| Image Classification | ResNet, EfficientNet |
| Object Detection | DETR, RF-DETR, LW-DETR, RT-DETR |
| Instance Segmentation | DETR, RF-DETR |
| Zero-shot Classification | CLIP, SigLIP |
| Zero-shot Segmentation | SAM 3 |
| Embeddings Extraction | CLIP, SigLIP |
Add the following to your Package.swift file:
dependencies: [
.package(url: "https://github.com/petrukha-ivan/mlx-swift-vision", from: "0.0.1")
]Then add the library as a dependency for your targets:
dependencies: [
.product(name: "MLXVision", package: "mlx-swift-vision")
]Model loading is unified. Pass a model id and task type. The factory resolves the model architecture and returns a typed pipeline with input and output types bound to the task.
import MLXVision
let factory = ModelFactory.shared
let model = try await factory.load("microsoft/resnet-50", for: ImageClassificationTask.self)You can override the default model configuration. For example, with the following configuration, it is possible to run the SAM 3 model even on an iPhone. By default, all models are loaded in bfloat16 format.
let inputSize = CGSize(width: 336.0, height: 336.0)
let overrides = ModelOverrides(inputSize: inputSize, quantizeBits: 4)
let factory = ModelFactory.shared
let model = try await factory.load("facebook/sam3", for: ZeroShotSegmentationTask.self, overrides: overrides)let request = ImageClassificationRequest(image: image)
let results = try model(request).top(5) // Top 5 items sorted by score
let summary = results.map { "\($0.label) - \($0.score)" } // Labels with scoreslet request = ObjectDetectionRequest(image: image)
let result = try model(request).top(1)[0] // Top detection result
let bbox = result.bbox // Normalized CGRectlet request = InstanceSegmentationRequest(image: image)
let result = try model(request).top(1)[0] // Top segmentation result
let mask = result.mask // Grayscale CIImage masklet labels = ["cat", "dog", "car"]
let request = ZeroShotClassificationRequest(image: image, labels: labels)
let results = try model(request) // 3 items with scores for each provided labellet prompt = "orange cat"
let request = ZeroShotSegmentationRequest(image: image, prompt: prompt)
let results = try model(request) // N items with masks matching prompt descriptionModels return raw results: classification scores, normalized bounding boxes, and segmentation masks. You can process these results manually, or use image annotators:
let request = ObjectDetectionRequest(image: image)
let results = try model(request)
let annotator = BoxAnnotator(lineWidth: 8.0) // See also LabelAnnotator, MaskAnnotator
let annotatedImage = annotator.annotate(image: image, detections: results)This repository includes a fully featured iOS/macOS app. You can find more usage examples inside. It includes photo library processing and live-camera processing to test real-time model performance. Build the project in Release configuration to ensure the best performance.
This is not a comprehensive and accurate test, the numbers below are approximate live-camera inference measurements. Models tested with bfloat16 dtype without any quantization.
Metrics on M3 Max:
| Model | Input Size | Average Processing Time | Frames per Second |
|---|---|---|---|
| roboflow/rf-detr-nano | 384x384 | 10 ms | 102 |
| roboflow/rf-detr-seg-nano | 384x384 | 21 ms | 48 |
| facebook/detr-resnet-50 | 448x448 | 12 ms | 82 |
| facebook/detr-resnet-50-panoptic | 448x448 | 48 ms | 20 |
| facebook/sam3 | 336x336 | 120 ms | 8 |
Metrics on iPhone 16 Pro Max are obviously lower, but still practical even for interactive use:
| Model | Input Size | Average Processing Time | Frames per Second |
|---|---|---|---|
| roboflow/rf-detr-nano | 384x384 | 32 ms | 31 |
| roboflow/rf-detr-seg-nano | 384x384 | 85 ms | 11 |
| facebook/detr-resnet-50 | 448x448 | 40 ms | 25 |
| facebook/detr-resnet-50-panoptic | 448x448 | 230 ms | 4 |
| facebook/sam3 | 336x336 | 500 ms | 2 |
This library does not redistribute model resources such as weights or tokenizers. You must obtain an access token if model access is limited. While the library has a permissive license, you still have to comply with each model-specific license.
This library is in an early stage of development. If you encounter a problem, please create an issue or open a pull request. Contributions are welcome!





