Skip to content

RobotFlow-Labs/curobo-mlx

cuRobo-MLX

Python 3.10+ License: MIT MLX Tests Apple Silicon

GPU-accelerated robot motion planning on Apple Silicon.

A port of NVIDIA cuRobo from CUDA to MLX, enabling real-time collision-free trajectory generation on M-series Macs.

Built by AIFLOW LABS / RobotFlow Labs


Why cuRobo-MLX

  • No NVIDIA GPU required -- run production-grade motion planning on any Apple Silicon Mac
  • Adapter architecture, not a fork -- upstream cuRobo stays read-only as a git submodule; zero merge conflicts on updates
  • 12,000 lines of CUDA replaced by 8 pure MLX kernels -- same algorithms, native Metal acceleration

Install

Requires macOS with Apple Silicon (M1/M2/M3/M4) and Python 3.10+.

# Clone with upstream submodule
git clone --recursive https://github.com/RobotFlow-Labs/curobo-mlx.git
cd curobo-mlx

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e .

Quick Start

Forward Kinematics

import mlx.core as mx
from curobo_mlx.kernels.kinematics import forward_kinematics_batched

# Compute link poses for 100 joint configurations
q = mx.random.uniform(-3.14, 3.14, (100, 7))
link_poses, link_quats, spheres = forward_kinematics_batched(q, ...)
# link_poses: [100, n_links, 3]
# spheres:    [100, n_spheres, 4]

IK Solving

from curobo_mlx.api import IKSolver
from curobo_mlx.adapters.types import MLXPose
import mlx.core as mx

solver = IKSolver.from_robot_name("franka", num_seeds=32)
goal = MLXPose(
    position=mx.array([0.4, 0.0, 0.5]),
    quaternion=mx.array([1.0, 0.0, 0.0, 0.0]),
)
result = solver.solve(goal)
if result.success:
    print(f"Solution: {result.solution}")
    print(f"Error: {result.position_error * 1000:.1f}mm")

Motion Planning

from curobo_mlx.api import MotionGen
import mlx.core as mx

planner = MotionGen.from_robot_name("franka")
result = planner.plan(start_config, goal_pose)
print(f"Trajectory: {result.trajectory.shape}")  # [T, 7]

Architecture

cuRobo-MLX wraps upstream cuRobo through an adapter layer. The upstream repo is a read-only git submodule -- configs, URDFs, and robot definitions are reused directly while all CUDA kernels are replaced with native MLX implementations.

graph TB
    subgraph API["User API"]
        IK["IKSolver"]
        TO["TrajOptSolver"]
        MG["MotionGen"]
    end

    subgraph AD["Adapters"]
        RM["Robot Model"]
        CO["Cost Functions"]
        DY["Dynamics"]
        OP["Optimizers"]
    end

    subgraph KN["MLX Kernels"]
        FK["Forward Kinematics"]
        CD["Collision Detection"]
        SC["Self-Collision"]
        PD["Pose Distance"]
        LB["L-BFGS"]
        LS["Line Search"]
        TS["Tensor Step"]
        UB["Update Best"]
    end

    subgraph UP["Upstream cuRobo"]
        YC["YAML Configs"]
        UR["URDF Assets"]
        RB["Robot Models"]
    end

    IK --> RM
    IK --> CO
    TO --> DY
    TO --> OP
    MG --> IK
    MG --> TO

    RM --> FK
    CO --> PD
    CO --> CD
    CO --> SC
    OP --> LB
    OP --> LS
    OP --> TS
    OP --> UB

    FK --> YC
    CD --> UR
    RM --> RB

    style API fill:#2563eb,stroke:#1e40af,color:#fff
    style AD fill:#16a34a,stroke:#15803d,color:#fff
    style KN fill:#ea580c,stroke:#c2410c,color:#fff
    style UP fill:#6b7280,stroke:#4b5563,color:#fff
Loading

Motion Planning Pipeline

flowchart LR
    A["Start Config"] --> B["IK Solver"]
    B --> C{"IK Success?"}
    C -->|Yes| D["TrajOpt"]
    C -->|No| E["Return Error"]
    D --> F{"Collision Free?"}
    F -->|Yes| G["Smooth Trajectory"]
    F -->|No| H["Re-optimize"]
    H --> D

    style A fill:#2563eb,color:#fff
    style G fill:#16a34a,color:#fff
    style E fill:#dc2626,color:#fff
Loading

Performance

Benchmarked on Apple M-series (unified memory architecture):

Operation B=1 B=100 B=1000
Forward Kinematics (7-DOF) 1.3 ms 2.3 ms 6.0 ms
Collision Check (52 spheres x 20 obstacles) 0.8 ms 5.9 ms --
L-BFGS Iteration 0.2 ms -- --
MPPI Iteration (128 particles) 0.3 ms -- --

All timings include MLX graph compilation. Batch sizes scale sub-linearly due to Metal GPU parallelism.

How MLX Uses the GPU

MLX runs all tensor operations on Apple Silicon's Metal GPU by default. The key advantage for robotics is unified memory -- CPU and GPU share the same physical RAM, eliminating the data transfer overhead that dominates real-time control loops on discrete GPU systems (CUDA requires explicit CPU-to-GPU copies).

┌──────────────────────────────────────────────┐
│            Apple Silicon (M1/M2/M3/M4)       │
│                                              │
│   ┌──────────┐    ┌──────────────────────┐   │
│   │ CPU Cores│    │   Metal GPU Cores    │   │
│   │ (P+E)    │    │   (up to 40 cores)   │   │
│   └────┬─────┘    └──────────┬───────────┘   │
│        │                     │               │
│        └──────────┬──────────┘               │
│                   │                          │
│        ┌──────────▼──────────┐               │
│        │   Unified Memory    │ ◄── zero-copy │
│        │   (shared RAM)      │               │
│        └─────────────────────┘               │
└──────────────────────────────────────────────┘

GPU speedup at scale (matmul benchmark):

Matrix Size GPU CPU Speedup
1000 x 1000 3.9 ms 1.3 ms 0.3x (overhead dominates)
3000 x 3000 10.2 ms 35.2 ms 3.4x
5000 x 5000 43.9 ms 155.0 ms 3.5x

cuRobo-MLX benefits from GPU acceleration in batched FK (100+ configs), collision checking (52 spheres x 20 obstacles), and MPPI sampling (128+ particles).


CUDA to MLX Kernel Port

Eight CUDA kernel files (approximately 12,000 lines) were replaced by pure MLX implementations:

gantt
    title CUDA Kernel Port -- 12K Lines to 8 MLX Modules
    dateFormat X
    axisFormat %s

    section Geometry
    Forward Kinematics     :done, fk, 0, 2200
    Collision Detection    :done, cd, 0, 2800
    Self-Collision         :done, sc, 0, 1500
    Pose Distance          :done, pd, 0, 1200

    section Optimization
    L-BFGS                 :done, lb, 0, 1800
    Line Search            :done, ls, 0, 1200
    Tensor Step            :done, ts, 0, 800
    Update Best            :done, ub, 0, 500
Loading
MLX Kernel Replaces (CUDA) Approximate CUDA Lines
kinematics.py kinematics_fused_cu.cpp ~2,200
collision.py geom_cu.cpp (sphere-OBB) ~2,800
self_collision.py self_collision_cu.cpp ~1,500
pose_distance.py pose_distance_cu.cpp ~1,200
lbfgs.py lbfgs_cu.cpp ~1,800
line_search.py line_search_cu.cpp ~1,200
tensor_step.py tensor_step_cu.cpp ~800
update_best.py update_best_cu.cpp ~500
Total ~12,000

Supported Robots

Any robot with a URDF file is supported. Pre-configured robots from upstream cuRobo:

Robot DOF Config
Franka Emika Panda 7 franka.yml
Universal Robots UR5e 6 ur5e.yml
Universal Robots UR10e 6 ur10e.yml
Kinova Gen3 7 kinova_gen3.yml
KUKA iiwa 7 iiwa.yml

Additional robots available in repositories/curobo-upstream/src/curobo/content/configs/robot/.


Examples

File Description
00_quickstart.py System check -- verify everything works
01_forward_kinematics.py Batch FK with timing and sweep
02_collision_checking.py Sphere-OBB collision detection
03_ik_solver.py Inverse kinematics with 32 seeds
04_self_collision.py Self-collision detection
05_trajectory_optimization.py Trajectory optimization pipeline
06_motion_planning.py Full IK + TrajOpt pipeline
07_benchmark_quick.py Performance benchmark on your machine
# Start here
uv run python examples/00_quickstart.py

Development

# Install dev dependencies
uv sync --extra dev

# Run tests (343 tests)
uv run pytest tests/ -q

# Run benchmarks
uv run python benchmarks/run_all.py

# Sync upstream cuRobo
cd repositories/curobo-upstream && git pull && cd ../..

Project Structure

curobo-mlx/
  src/curobo_mlx/
    api/                 # High-level solvers: IKSolver, TrajOpt, MotionGen
    adapters/            # Robot model, cost functions, dynamics, optimizers
      costs/             # Pose, collision, self-collision, bound, stop costs
      optimizers/        # L-BFGS optimizer, MPPI, solver base
    kernels/             # 8 MLX kernels replacing CUDA
    curobolib/           # Drop-in bridge matching upstream cuRobo API
    util/                # Config loading, profiling
  tests/                 # 343 unit and integration tests
  benchmarks/            # Performance benchmarks
  examples/              # Runnable usage examples
  repositories/
    curobo-upstream/     # Upstream cuRobo (read-only submodule)

Contributing

Fork, branch, test (uv run pytest tests/ -q), and open a PR. Code style: Ruff, line length 100.


Citation

If you use this work, please cite both cuRobo and cuRobo-MLX:

@misc{curobo_mlx2026,
    title={cuRobo-MLX: GPU-Accelerated Motion Planning on Apple Silicon},
    author={AIFLOW LABS},
    year={2026},
    url={https://github.com/RobotFlow-Labs/curobo-mlx}
}

@misc{curobo_report23,
    title={cuRobo: Parallelized Collision-Free Minimum-Jerk Robot Motion Generation},
    author={Sundaralingam, Balakumar and others},
    year={2023},
    eprint={2310.17274},
    archivePrefix={arXiv}
}

License

MIT -- see LICENSE.

cuRobo upstream is subject to NVIDIA's license.

About

GPU-accelerated robot motion planning on Apple Silicon. Port of NVIDIA cuRobo (CUDA) to MLX — real-time collision-free trajectory generation on M-series Macs.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages