🎉 V2 UPDATE: Low-Latency Video/Audio/Simulation Streaming! Now stream video, audio, and MuJoCo simulations back to Vision Pro via WebRTC — alongside the original hand tracking stream, from any machine on the network. Update the app,
pip install --upgrade avp_stream, and you're ready!
This VisionOS app and python library allows anyone to get Head + Wrist + Hand Tracking from Vision Pro, and stream back stereo video / audio feeds from your robot to Vision Pro.
*For a more detailed explanation, check out this short paper.
- Bilateral Data Streams:
- AVP → Python: Wrist, Fingers and Head Tracking
- Python → AVP: Low-Latency Streaming of Video, Audio, MuJoCo environments presented in AR with RealityKit
- Direct UVC Camera Connection: It supports directly connecting a UVC camera to Vision Pro using its Developer Strap, which can be useful for:
- extremely low latency video streaming experience
- record human manipulation videos with egocentric RGB video with accurate hand/head tracking data, bypassing the limited access to RGB camera feeds on AVP.
- Recording to iCloud Drive: The app can record the incoming/outgoing data streams (video + hand tracking) to personal iCloud Drive for easy data syncing and sharing.
- Wired Network Connection : It also supports wired network connection with a computer that's running our Python client via Developer Strap for low latency communication.
The best way to learn about this app is to go through the examples folder. It contains 13 examples covering hand tracking, video/audio streaming, and sample teleoperation script using AR simulation streaming feature.
Step 1. Install the app on Vision Pro
This app is officially on VisionOS App Store! You can search for Tracking Streamer from the App Store and install the app.
If you want to play around with the app, you can build/install the app yourself too. To learn how to do that, take a look at this documentation. This requires (a) Apple Developer Account, (b) Vision Pro Developer Strap, and (c) a Mac with Xcode installed.
After installation, click and open the app on Vision Pro. It should show something like this. Click START with Video Streaming if you want to stream back videos from your robot over webRTC. Otherwise, click the left button. That's it!
Step 2. Install Python Library
The following python package allows you to receive the data stream from any device that's connected to the same WiFi network. Install the package:
pip install --upgrade avp_stream
Add this code snippet to any of your projects:
from avp_stream import VisionProStreamer
avp_ip = "10.31.181.201" # example IP
s = VisionProStreamer(ip = avp_ip)
while True:
r = s.latest
print(r['head'], r['right_wrist'], r['right_fingers'])Stream your robot's video feed, audio, and even MuJoCo simulations back to Vision Pro via WebRTC. Make sure you upgrade both the python library and the VisionOS app.
from avp_stream import VisionProStreamer
from time import sleep
avp_ip = "10.31.181.201" # Vision Pro IP (shown in the app)
s = VisionProStreamer(ip = avp_ip)
# Configure and start video streaming
s.configure_video(device="/dev/video0", format="v4l2", size="640x480", fps=30)
s.start_webrtc() # Start streaming to Vision Pro
while True:
r = s.latest
print(r['head'], r['right_wrist'], r['right_fingers'])
sleep(0.01) # avoid blocking the loop and let frames be sentExample 1: Camera with custom processing:
s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(my_image_processor) # Your processing function
s.configure_video(device="/dev/video0", format="v4l2", size="640x480", fps=30)
s.start_webrtc()Example 2: Stereo camera (side-by-side):
s = VisionProStreamer(ip=avp_ip)
s.configure_video(device="/dev/video0", format="v4l2", size="1920x1080", fps=30, stereo=True)
s.start_webrtc()Example 3: Synthetic video:
s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(render_simulation_frame) # Generate frames programmatically
s.configure_video(size="1280x720", fps=60) # No device = synthetic mode
s.start_webrtc()s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(visualizer)
s.register_audio_callback(audio_generator) # Generate or process audio
s.configure_video(size="1280x720", fps=60)
s.configure_audio(sample_rate=48000, stereo=True)
s.start_webrtc()Stream your MuJoCo physics simulations directly into AR! The simulation scene is converted to USD and rendered natively on Vision Pro using RealityKit, with real-time pose updates streamed via webRTC.
arviewer-main-muted.mp4
import mujoco
model = mujoco.MjModel.from_xml_path("robot.xml")
data = mujoco.MjData(model)
s = VisionProStreamer(ip=avp_ip)
s.configure_sim("robot.xml", model, data, relative_to=[0, 1, 0.5, -90])
s.start_webrtc()
while True:
mujoco.mj_step(model, data)
s.update_sim() # Stream updated poses to Vision ProSince AR blends your simulation with the real world, you need to decide where the simulation's world frame should be placed in your physical space. Use relative_to parameter:
- 4-dim:
[x, y, z, yaw°]— translation + rotation around z-axis (degrees) - 7-dim:
[x, y, z, qw, qx, qy, qz]— full quaternion orientation
# Place world frame 0.8m above ground, rotated 90° around z-axis
s.configure_sim("robot.xml", model, data, relative_to=[0, 0, 0.8, 90])Default Behavior: VisionOS automatically detects the physical ground and places the origin there (below your feet if standing, below your chair if sitting).
| Examples from MuJoCo Menagerie | Unitree G1 XML | Google Robot XML | ALOHA 2 XML |
|---|---|---|---|
Visualization of world frame |
![]() |
![]() |
![]() |
world frame is attached on a "ground". |
world frame is attached on a "ground". |
world frame is attached on a "table". |
|
Recommended attach_to |
Default Setting | Default Setting | Offset in z-axis, that can bring up the table surface to reasonable height in your real world. |
When using MuJoCo simulation streaming, you often want hand tracking data in the simulation's coordinate frame (not Vision Pro's native frame). By default, calling configure_sim() automatically sets origin="sim", so hand positions are relative to your scene's world frame.
s = VisionProStreamer(ip=avp_ip)
s.configure_sim("robot.xml", model, data, relative_to=[0, 0, 0.8, 90])
# origin is now "sim" — hand tracking is in simulation coordinates
# You can also switch manually:
s.set_origin("avp") # Vision Pro's native coordinate frame
s.set_origin("sim") # Simulation's coordinate frame (relative to attach_to)| Origin | Hand Tracking Frame | Use Case |
|---|---|---|
"avp" |
Vision Pro ground frame | Default, general hand tracking |
"sim" |
Simulation world frame | Teleoperation, robot control in sim |
Video Configuration (configure_video):
| Parameter | Description | Example |
|---|---|---|
device |
Camera device. None for synthetic frames |
"/dev/video0", "0:none" (macOS), None |
format |
Video format | "v4l2" (Linux), "avfoundation" (macOS) |
size |
Resolution as "WxH" | "640x480", "1280x720", "1920x1080" |
fps |
Frame rate | 30, 60 |
stereo |
Side-by-side stereo video | True, False |
Audio Configuration (configure_audio):
| Parameter | Description | Example |
|---|---|---|
device |
Audio device. None for synthetic audio |
":0" (default mic on macOS), None |
sample_rate |
Sample rate in Hz | 48000 |
stereo |
Stereo or mono audio | True, False |
Simulation Configuration (configure_sim):
| Parameter | Description | Example |
|---|---|---|
xml_path |
Path to MuJoCo XML scene | "scene.xml" |
model |
MuJoCo model object | mujoco.MjModel |
data |
MuJoCo data object | mujoco.MjData |
relative_to |
Scene placement [x, y, z, yaw°] or [x, y, z, qw, qx, qy, qz] | [0, 1, 0.5, -90] |
Note: Finding the right combination of device, format, size, and fps can be tricky since cameras only support certain combinations. Use this script to find valid configurations:
python test_video_devices.py --live \
--device /dev/video0 --format v4l2 --size 640x480 --fps 30r = s.latestr is a dictionary containing the following data streamed from AVP:
r['head']: np.ndarray
# shape (1,4,4) / measured from ground frame
r['right_wrist']: np.ndarray
# shape (1,4,4) / measured from ground frame
r['left_wrist']: np.ndarray
# shape (1,4,4) / measured from ground frame
r['right_fingers']: np.ndarray
# shape (25,4,4) / measured from right wrist frame
r['left_fingers']: np.ndarray
# shape (25,4,4) / measured from left wrist frame
r['right_pinch_distance']: float
# distance between right index tip and thumb tip
r['left_pinch_distance']: float
# distance between left index tip and thumb tip
r['right_wrist_roll']: float
# rotation angle of your right wrist around your arm axis
r['left_wrist_roll']: float
# rotation angle of your left wrist around your arm axisRefer to the image below to see how the axis are defined for your head, wrist, and fingers.
Refer to the image above to see what order the joints are represented in each hand's skeleton.
We performed comprehensive round-trip latency measurements to benchmark our video streaming system. The measurement captures the full cycle:
- Python encodes a timestamp into a video frame as a marker
- WebRTC transmission happens over the network
- Vision Pro decodes the image, and reads the marker ID
- sends the marker ID back to Python.
- Python calculates latency.
This provides a conservative upper bound on user-experienced latency. According to our own testing, the system can consistently hit under 100ms both in wired mode and wireless mode for resolution under 720p. When wired up (requires developer strap), you can get stable 50ms latency even for stereo 4K streaming.
For detailed methodology, test configurations, and complete results, see the Benchmark Documentation.
If you use this repository in your work, consider citing:
@software{park2024avp,
title={Using Apple Vision Pro to Train and Control Robots},
author={Park, Younghyo and Agrawal, Pulkit},
year={2024},
url = {https://github.com/Improbable-AI/VisionProTeleop},
}
We acknowledge support from Hyundai Motor Company and ARO MURI grant number W911NF-23-1-0277.







