Skip to content

VisionOS App + Python Library to stream hand tracking data from Vision Pro, video/audio stream to Vision Pro.

License

Notifications You must be signed in to change notification settings

Improbable-AI/VisionProTeleop

Repository files navigation

VisionProTeleop

🎉 V2 UPDATE: Low-Latency Video/Audio/Simulation Streaming! Now stream video, audio, and MuJoCo simulations back to Vision Pro via WebRTC — alongside the original hand tracking stream, from any machine on the network. Update the app, pip install --upgrade avp_stream, and you're ready!

CI CI CI

This VisionOS app and python library allows anyone to get Head + Wrist + Hand Tracking from Vision Pro, and stream back stereo video / audio feeds from your robot to Vision Pro.

*For a more detailed explanation, check out this short paper.

Table of Contents

Key Features

  1. Bilateral Data Streams:
    • AVP → Python: Wrist, Fingers and Head Tracking
    • Python → AVP: Low-Latency Streaming of Video, Audio, MuJoCo environments presented in AR with RealityKit
  2. Direct UVC Camera Connection: It supports directly connecting a UVC camera to Vision Pro using its Developer Strap, which can be useful for:
    • extremely low latency video streaming experience
    • record human manipulation videos with egocentric RGB video with accurate hand/head tracking data, bypassing the limited access to RGB camera feeds on AVP.
  3. Recording to iCloud Drive: The app can record the incoming/outgoing data streams (video + hand tracking) to personal iCloud Drive for easy data syncing and sharing.
  4. Wired Network Connection : It also supports wired network connection with a computer that's running our Python client via Developer Strap for low latency communication.

Examples

The best way to learn about this app is to go through the examples folder. It contains 13 examples covering hand tracking, video/audio streaming, and sample teleoperation script using AR simulation streaming feature.

Getting Started

Installation

Step 1. Install the app on Vision Pro

This app is officially on VisionOS App Store! You can search for Tracking Streamer from the App Store and install the app.

If you want to play around with the app, you can build/install the app yourself too. To learn how to do that, take a look at this documentation. This requires (a) Apple Developer Account, (b) Vision Pro Developer Strap, and (c) a Mac with Xcode installed.

After installation, click and open the app on Vision Pro. It should show something like this. Click START with Video Streaming if you want to stream back videos from your robot over webRTC. Otherwise, click the left button. That's it!

Step 2. Install Python Library

The following python package allows you to receive the data stream from any device that's connected to the same WiFi network. Install the package:

pip install --upgrade avp_stream 

Basic Usage

Add this code snippet to any of your projects:

from avp_stream import VisionProStreamer
avp_ip = "10.31.181.201"   # example IP 
s = VisionProStreamer(ip = avp_ip)

while True:
    r = s.latest
    print(r['head'], r['right_wrist'], r['right_fingers'])

Streaming Guide

Stream your robot's video feed, audio, and even MuJoCo simulations back to Vision Pro via WebRTC. Make sure you upgrade both the python library and the VisionOS app.

Video Streaming

from avp_stream import VisionProStreamer
from time import sleep
avp_ip = "10.31.181.201"   # Vision Pro IP (shown in the app)
s = VisionProStreamer(ip = avp_ip)

# Configure and start video streaming
s.configure_video(device="/dev/video0", format="v4l2", size="640x480", fps=30)
s.start_webrtc()  # Start streaming to Vision Pro

while True:
    r = s.latest
    print(r['head'], r['right_wrist'], r['right_fingers'])
    sleep(0.01)  # avoid blocking the loop and let frames be sent

Example 1: Camera with custom processing:

s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(my_image_processor)  # Your processing function
s.configure_video(device="/dev/video0", format="v4l2", size="640x480", fps=30)
s.start_webrtc()

Example 2: Stereo camera (side-by-side):

s = VisionProStreamer(ip=avp_ip)
s.configure_video(device="/dev/video0", format="v4l2", size="1920x1080", fps=30, stereo=True)
s.start_webrtc()

Example 3: Synthetic video:

s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(render_simulation_frame)  # Generate frames programmatically
s.configure_video(size="1280x720", fps=60)  # No device = synthetic mode
s.start_webrtc()

Audio Streaming

s = VisionProStreamer(ip=avp_ip)
s.register_frame_callback(visualizer)
s.register_audio_callback(audio_generator)  # Generate or process audio
s.configure_video(size="1280x720", fps=60)
s.configure_audio(sample_rate=48000, stereo=True)
s.start_webrtc()

MuJoCo Simulation Streaming

Stream your MuJoCo physics simulations directly into AR! The simulation scene is converted to USD and rendered natively on Vision Pro using RealityKit, with real-time pose updates streamed via webRTC.

arviewer-main-muted.mp4
import mujoco
model = mujoco.MjModel.from_xml_path("robot.xml")
data = mujoco.MjData(model)

s = VisionProStreamer(ip=avp_ip)
s.configure_sim("robot.xml", model, data, relative_to=[0, 1, 0.5, -90])
s.start_webrtc()

while True:
    mujoco.mj_step(model, data)
    s.update_sim()  # Stream updated poses to Vision Pro

Positioning Your Simulation in AR (relative_to)

Since AR blends your simulation with the real world, you need to decide where the simulation's world frame should be placed in your physical space. Use relative_to parameter:

  • 4-dim: [x, y, z, yaw°] — translation + rotation around z-axis (degrees)
  • 7-dim: [x, y, z, qw, qx, qy, qz] — full quaternion orientation
# Place world frame 0.8m above ground, rotated 90° around z-axis
s.configure_sim("robot.xml", model, data, relative_to=[0, 0, 0.8, 90])

Default Behavior: VisionOS automatically detects the physical ground and places the origin there (below your feet if standing, below your chair if sitting).

Examples from MuJoCo Menagerie Unitree G1 XML Google Robot XML ALOHA 2 XML
Visualization of world frame
world frame is attached on a "ground". world frame is attached on a "ground". world frame is attached on a "table".
Recommended attach_to Default Setting Default Setting Offset in z-axis, that can bring up the table surface to reasonable height in your real world.

Hand Tracking Coordinate Frame (origin)

When using MuJoCo simulation streaming, you often want hand tracking data in the simulation's coordinate frame (not Vision Pro's native frame). By default, calling configure_sim() automatically sets origin="sim", so hand positions are relative to your scene's world frame.

s = VisionProStreamer(ip=avp_ip)
s.configure_sim("robot.xml", model, data, relative_to=[0, 0, 0.8, 90])
# origin is now "sim" — hand tracking is in simulation coordinates

# You can also switch manually:
s.set_origin("avp")  # Vision Pro's native coordinate frame
s.set_origin("sim")  # Simulation's coordinate frame (relative to attach_to)
Origin Hand Tracking Frame Use Case
"avp" Vision Pro ground frame Default, general hand tracking
"sim" Simulation world frame Teleoperation, robot control in sim

Configuration Reference

Video Configuration (configure_video):

Parameter Description Example
device Camera device. None for synthetic frames "/dev/video0", "0:none" (macOS), None
format Video format "v4l2" (Linux), "avfoundation" (macOS)
size Resolution as "WxH" "640x480", "1280x720", "1920x1080"
fps Frame rate 30, 60
stereo Side-by-side stereo video True, False

Audio Configuration (configure_audio):

Parameter Description Example
device Audio device. None for synthetic audio ":0" (default mic on macOS), None
sample_rate Sample rate in Hz 48000
stereo Stereo or mono audio True, False

Simulation Configuration (configure_sim):

Parameter Description Example
xml_path Path to MuJoCo XML scene "scene.xml"
model MuJoCo model object mujoco.MjModel
data MuJoCo data object mujoco.MjData
relative_to Scene placement [x, y, z, yaw°] or [x, y, z, qw, qx, qy, qz] [0, 1, 0.5, -90]

Note: Finding the right combination of device, format, size, and fps can be tricky since cameras only support certain combinations. Use this script to find valid configurations:

python test_video_devices.py --live \
  --device /dev/video0 --format v4l2 --size 640x480 --fps 30

API Reference

Available Data

r = s.latest

r is a dictionary containing the following data streamed from AVP:

r['head']: np.ndarray  
  # shape (1,4,4) / measured from ground frame
r['right_wrist']: np.ndarray 
  # shape (1,4,4) / measured from ground frame
r['left_wrist']: np.ndarray 
  # shape (1,4,4) / measured from ground frame
r['right_fingers']: np.ndarray 
  # shape (25,4,4) / measured from right wrist frame 
r['left_fingers']: np.ndarray 
  # shape (25,4,4) / measured from left wrist frame 
r['right_pinch_distance']: float  
  # distance between right index tip and thumb tip 
r['left_pinch_distance']: float  
  # distance between left index tip and thumb tip 
r['right_wrist_roll']: float 
  # rotation angle of your right wrist around your arm axis
r['left_wrist_roll']: float 
 # rotation angle of your left wrist around your arm axis

Axis Convention

Refer to the image below to see how the axis are defined for your head, wrist, and fingers.

Hand Skeleton

Refer to the image above to see what order the joints are represented in each hand's skeleton.

Performance

Latency Benchmark Results

We performed comprehensive round-trip latency measurements to benchmark our video streaming system. The measurement captures the full cycle:

  1. Python encodes a timestamp into a video frame as a marker
  2. WebRTC transmission happens over the network
  3. Vision Pro decodes the image, and reads the marker ID
  4. sends the marker ID back to Python.
  5. Python calculates latency.

This provides a conservative upper bound on user-experienced latency. According to our own testing, the system can consistently hit under 100ms both in wired mode and wireless mode for resolution under 720p. When wired up (requires developer strap), you can get stable 50ms latency even for stereo 4K streaming.

For detailed methodology, test configurations, and complete results, see the Benchmark Documentation.

Appendix

Citation

If you use this repository in your work, consider citing:

@software{park2024avp,
    title={Using Apple Vision Pro to Train and Control Robots},
    author={Park, Younghyo and Agrawal, Pulkit},
    year={2024},
    url = {https://github.com/Improbable-AI/VisionProTeleop},
}

Acknowledgements

We acknowledge support from Hyundai Motor Company and ARO MURI grant number W911NF-23-1-0277.

Star History Chart