VisionProTeleop

🎉 V2 UPDATE: Low-Latency Video/Audio/Simulation Streaming! Now stream video, audio, and MuJoCo simulations back to Vision Pro via WebRTC — alongside the original hand tracking stream, from any machine on the network. Update the app, pip install --upgrade avp_stream, and you're ready!

This VisionOS app and python library allows anyone to get Head + Wrist + Hand Tracking from Vision Pro, and stream back stereo video / audio feeds from your robot to Vision Pro.

*For a more detailed explanation, check out this short paper.

Key Features

Bilateral Data Streams:
- AVP → Python: Wrist, Fingers and Head Tracking
- Python → AVP: Low-Latency Streaming of Video, Audio, MuJoCo environments presented in AR with RealityKit
Direct UVC Camera Connection: It supports directly connecting a UVC camera to Vision Pro using its Developer Strap, which can be useful for:
- extremely low latency video streaming experience
- record human manipulation videos with egocentric RGB video with accurate hand/head tracking data, bypassing the limited access to RGB camera feeds on AVP.
Recording to iCloud Drive: The app can record the incoming/outgoing data streams (video + hand tracking) to personal iCloud Drive for easy data syncing and sharing.
Wired Network Connection : It also supports wired network connection with a computer that's running our Python client via Developer Strap for low latency communication.

Examples

The best way to learn about this app is to go through the examples folder. It contains 13 examples covering hand tracking, video/audio streaming, and sample teleoperation script using AR simulation streaming feature.

Getting Started

Installation

Step 1. Install the app on Vision Pro

This app is officially on VisionOS App Store! You can search for Tracking Streamer from the App Store and install the app.

If you want to play around with the app, you can build/install the app yourself too. To learn how to do that, take a look at this documentation. This requires (a) Apple Developer Account, (b) Vision Pro Developer Strap, and (c) a Mac with Xcode installed.

After installation, click and open the app on Vision Pro. It should show something like this. Click START with Video Streaming if you want to stream back videos from your robot over webRTC. Otherwise, click the left button. That's it!

Step 2. Install Python Library

The following python package allows you to receive the data stream from any device that's connected to the same WiFi network. Install the package:

pip install --upgrade avp_stream

Basic Usage

Add this code snippet to any of your projects:

from avp_stream import VisionProStreamer
avp_ip = "10.31.181.201"   # example IP 
s = VisionProStreamer(ip = avp_ip)

while True:
    r = s.latest
    print(r['head'], r['right_wrist'], r['right_fingers'])

Streaming Guide

Stream your robot's video feed, audio, and even MuJoCo simulations back to Vision Pro via WebRTC. Make sure you upgrade both the python library and the VisionOS app.

Video Streaming

from avp_stream import VisionProStreamer
from time import sleep
avp_ip = "10.31.181.201"   # Vision Pro IP (shown in the app)
s = VisionProStreamer(ip = avp_ip)

# Configure and start video streaming
s.configure_video(device="/dev/video0", format="v4l2", size="640x480", fps=30)
s.start_webrtc()  # Start streaming to Vision Pro

while True:
    r = s.latest
    print(r['head'], r['right_wrist'], r['right_fingers'])
    sleep(0.01)  # avoid blocking the loop and let frames be sent

Example 1: Camera with custom processing:

s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(my_image_processor)  # Your processing function
s.configure_video(device="/dev/video0", format="v4l2", size="640x480", fps=30)
s.start_webrtc()

Example 2: Stereo camera (side-by-side):

s = VisionProStreamer(ip=avp_ip)
s.configure_video(device="/dev/video0", format="v4l2", size="1920x1080", fps=30, stereo=True)
s.start_webrtc()

Example 3: Synthetic video:

s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(render_simulation_frame)  # Generate frames programmatically
s.configure_video(size="1280x720", fps=60)  # No device = synthetic mode
s.start_webrtc()

Audio Streaming

s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(visualizer)
s.register_audio_callback(audio_generator)  # Generate or process audio
s.configure_video(size="1280x720", fps=60)
s.configure_audio(sample_rate=48000, stereo=True)
s.start_webrtc()

MuJoCo Simulation Streaming

Stream your MuJoCo physics simulations directly into AR! The simulation scene is converted to USD and rendered natively on Vision Pro using RealityKit, with real-time pose updates streamed via webRTC.

arviewer-main-muted.mp4

import mujoco
model = mujoco.MjModel.from_xml_path("robot.xml")
data = mujoco.MjData(model)

s = VisionProStreamer(ip=avp_ip)
s.configure_sim("robot.xml", model, data, relative_to=[0, 1, 0.5, -90])
s.start_webrtc()

while True:
    mujoco.mj_step(model, data)
    s.update_sim()  # Stream updated poses to Vision Pro

Positioning Your Simulation in AR (`relative_to`)

Since AR blends your simulation with the real world, you need to decide where the simulation's world frame should be placed in your physical space. Use relative_to parameter:

4-dim: [x, y, z, yaw°] — translation + rotation around z-axis (degrees)
7-dim: [x, y, z, qw, qx, qy, qz] — full quaternion orientation

# Place world frame 0.8m above ground, rotated 90° around z-axis
s.configure_sim("robot.xml", model, data, relative_to=[0, 0, 0.8, 90])

Default Behavior: VisionOS automatically detects the physical ground and places the origin there (below your feet if standing, below your chair if sitting).

Examples from MuJoCo Menagerie	Unitree G1 XML	Google Robot XML	ALOHA 2 XML
Visualization of `world` frame
	`world` frame is attached on a "ground".	`world` frame is attached on a "ground".	`world` frame is attached on a "table".
Recommended `attach_to`	Default Setting	Default Setting	Offset in `z-axis`, that can bring up the table surface to reasonable height in your real world.

Hand Tracking Coordinate Frame (`origin`)

When using MuJoCo simulation streaming, you often want hand tracking data in the simulation's coordinate frame (not Vision Pro's native frame). By default, calling configure_sim() automatically sets origin="sim", so hand positions are relative to your scene's world frame.

s = VisionProStreamer(ip=avp_ip)
s.configure_sim("robot.xml", model, data, relative_to=[0, 0, 0.8, 90])
# origin is now "sim" — hand tracking is in simulation coordinates

# You can also switch manually:
s.set_origin("avp")  # Vision Pro's native coordinate frame
s.set_origin("sim")  # Simulation's coordinate frame (relative to attach_to)

Origin	Hand Tracking Frame	Use Case
`"avp"`	Vision Pro ground frame	Default, general hand tracking
`"sim"`	Simulation world frame	Teleoperation, robot control in sim

Configuration Reference

Video Configuration (configure_video):

Parameter	Description	Example
`device`	Camera device. `None` for synthetic frames	`"/dev/video0"`, `"0:none"` (macOS), `None`
`format`	Video format	`"v4l2"` (Linux), `"avfoundation"` (macOS)
`size`	Resolution as "WxH"	`"640x480"`, `"1280x720"`, `"1920x1080"`
`fps`	Frame rate	`30`, `60`
`stereo`	Side-by-side stereo video	`True`, `False`

Audio Configuration (configure_audio):

Parameter	Description	Example
`device`	Audio device. `None` for synthetic audio	`":0"` (default mic on macOS), `None`
`sample_rate`	Sample rate in Hz	`48000`
`stereo`	Stereo or mono audio	`True`, `False`

Simulation Configuration (configure_sim):

Parameter	Description	Example
`xml_path`	Path to MuJoCo XML scene	`"scene.xml"`
`model`	MuJoCo model object	`mujoco.MjModel`
`data`	MuJoCo data object	`mujoco.MjData`
`relative_to`	Scene placement [x, y, z, yaw°] or [x, y, z, qw, qx, qy, qz]	`[0, 1, 0.5, -90]`

Note: Finding the right combination of device, format, size, and fps can be tricky since cameras only support certain combinations. Use this script to find valid configurations:

python test_video_devices.py --live \
  --device /dev/video0 --format v4l2 --size 640x480 --fps 30

API Reference

Available Data

r = s.latest

r is a dictionary containing the following data streamed from AVP:

r['head']: np.ndarray  
  # shape (1,4,4) / measured from ground frame
r['right_wrist']: np.ndarray 
  # shape (1,4,4) / measured from ground frame
r['left_wrist']: np.ndarray 
  # shape (1,4,4) / measured from ground frame
r['right_fingers']: np.ndarray 
  # shape (25,4,4) / measured from right wrist frame 
r['left_fingers']: np.ndarray 
  # shape (25,4,4) / measured from left wrist frame 
r['right_pinch_distance']: float  
  # distance between right index tip and thumb tip 
r['left_pinch_distance']: float  
  # distance between left index tip and thumb tip 
r['right_wrist_roll']: float 
  # rotation angle of your right wrist around your arm axis
r['left_wrist_roll']: float 
 # rotation angle of your left wrist around your arm axis

Axis Convention

Refer to the image below to see how the axis are defined for your head, wrist, and fingers.

Hand Skeleton

Refer to the image above to see what order the joints are represented in each hand's skeleton.

Performance

Latency Benchmark Results

We performed comprehensive round-trip latency measurements to benchmark our video streaming system. The measurement captures the full cycle:

Python encodes a timestamp into a video frame as a marker
WebRTC transmission happens over the network
Vision Pro decodes the image, and reads the marker ID
sends the marker ID back to Python.
Python calculates latency.

This provides a conservative upper bound on user-experienced latency. According to our own testing, the system can consistently hit under 100ms both in wired mode and wireless mode for resolution under 720p. When wired up (requires developer strap), you can get stable 50ms latency even for stereo 4K streaming.

For detailed methodology, test configurations, and complete results, see the Benchmark Documentation.

Appendix

Citation

If you use this repository in your work, consider citing:

@software{park2024avp,
    title={Using Apple Vision Pro to Train and Control Robots},
    author={Park, Younghyo and Agrawal, Pulkit},
    year={2024},
    url = {https://github.com/Improbable-AI/VisionProTeleop},
}

Acknowledgements

We acknowledge support from Hyundai Motor Company and ARO MURI grant number W911NF-23-1-0277.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
Tracking Streamer.xcodeproj		Tracking Streamer.xcodeproj
Tracking Streamer		Tracking Streamer
Tracking Viewer		Tracking Viewer
__pycache__		__pycache__
assets		assets
avp_stream		avp_stream
benchmarks		benchmarks
docs		docs
examples		examples
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
build_and_install.sh		build_and_install.sh
build_docs.sh		build_docs.sh
comparison.png		comparison.png
setup.py		setup.py
test_video_device.py		test_video_device.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VisionProTeleop

Table of Contents

Key Features

Examples

Getting Started

Installation

Basic Usage

Streaming Guide

Video Streaming

Audio Streaming

MuJoCo Simulation Streaming

Positioning Your Simulation in AR (`relative_to`)

Hand Tracking Coordinate Frame (`origin`)

Configuration Reference

API Reference

Available Data

Axis Convention

Hand Skeleton

Performance

Latency Benchmark Results

Appendix

Citation

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Languages

License

Improbable-AI/VisionProTeleop

Folders and files

Latest commit

History

Repository files navigation

VisionProTeleop

Table of Contents

Key Features

Examples

Getting Started

Installation

Basic Usage

Streaming Guide

Video Streaming

Audio Streaming

MuJoCo Simulation Streaming

Positioning Your Simulation in AR (relative_to)

Hand Tracking Coordinate Frame (origin)

Configuration Reference

API Reference

Available Data

Axis Convention

Hand Skeleton

Performance

Latency Benchmark Results

Appendix

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Languages

Positioning Your Simulation in AR (`relative_to`)

Hand Tracking Coordinate Frame (`origin`)

Packages