OpenSim biomechanics extension of Fast SAM 3D Body
Takes the Fast-SAM-3D-Body inference pipeline and exports every frame of a video directly to OpenSim-ready files: TRC marker trajectories, MOT joint angles, and animated GLB files for Blender / three.js.
| Feature | Fast-SAM-3D-Body | This repo |
|---|---|---|
| 3D body mesh inference | ✓ | ✓ |
| Annotated video output | ✓ | ✓ |
| OpenSim TRC marker file | — | ✓ |
| OpenSim MOT joint angle file | — | ✓ |
| Animated skeleton GLB (Blender / three.js) | — | ✓ |
| Animated full-body mesh GLB | — | ✓ (opt-in) |
--hands flag to toggle hand tracking |
— | ✓ |
| Full setup guides for Linux + Windows | — | ✓ |
| Per-GPU expected FPS table | — | ✓ |
| TRT engine build instructions | partial | ✓ |
For each input video, the pipeline writes:
output_opensim/
<video_name>_skeleton.mp4 — annotated video with 2D skeleton overlay
<video_name>.trc — OpenSim marker file (24 body landmarks, metres, Y-up)
<video_name>.mot — OpenSim motion file (15 joint angles, degrees)
<video_name>_skeleton.glb — animated skeleton (~340 KB for 1136 frames)
<video_name>_mesh.glb — animated full-body mesh (opt-in, --mesh_glb)
nose · l/r shoulder · l/r elbow · l/r wrist · l/r hip · l/r knee · l/r ankle · l/r big toe · l/r small toe · l/r heel · l/r olecranon · l/r acromion · neck
pelvis translation (tx/ty/tz) · l/r hip flexion · l/r hip adduction · l/r knee flexion · l/r ankle dorsiflexion · l/r elbow flexion · trunk flexion · trunk lateral lean
The MOT file uses geometric estimation from 3D landmark positions — no SMPL model needed. For production biomechanics, run OpenSim Inverse Kinematics on the TRC file instead; that output will be more accurate.
| Mode | FPS |
|---|---|
Body only — --inference_type body (default) |
14.7 fps |
Body + hands — --hands |
5.2 fps |
See COMPROMISES.md for a full breakdown of every trade-off made to reach these numbers.
- Linux: SETUP.md
- Windows: WINDOWS_SETUP.md
conda activate fast_sam_3d_body
SKIP_KEYPOINT_PROMPT=1 FOV_TRT=1 FOV_FAST=1 FOV_MODEL=s FOV_LEVEL=0 \
USE_TRT_BACKBONE=1 USE_COMPILE=1 DECODER_COMPILE=1 COMPILE_MODE=reduce-overhead \
MHR_NO_CORRECTIVES=1 GPU_HAND_PREP=1 BODY_INTERM_PRED_LAYERS=0,2 \
DEBUG_NAN=0 PARALLEL_DECODERS=0 COMPILE_WARMUP_BATCH_SIZES=1 \
python demo_video_opensim.py \
--video_path ./videos/my_video.mp4 \
--detector_model checkpoints/yolo/yolo11m-pose.engine \
--inference_type body \
--fx 1371Replace --fx 1371 with your camera focal length in pixels
(see HOW_TO_RUN.md for how to compute it).
If unknown, omit --fx and the pipeline will estimate it from the image.
To include hands (5.2 fps instead of 14.7 fps):
# add --hands to the command above
python demo_video_opensim.py ... --hands- Scale:
Tools → Scale → load your .osim modelusing a static standing TRC - Inverse Kinematics:
Tools → Inverse Kinematics → load TRC→ outputs a solved MOT
File → Import → glTF 2.0 → select _skeleton.glb or _mesh.glb
| Flag | Default | Description |
|---|---|---|
--video_path |
— | Input video |
--fx |
auto (MoGe) | Camera focal length in pixels |
--inference_type |
body |
body = 14.7 fps · full = hands, 5.2 fps |
--hands |
off | Shorthand for --inference_type full |
--mesh_glb |
off | Also write animated full-body mesh GLB |
--target_fps |
0 | Downsample input to this FPS (0 = every frame) |
--max_frames |
0 | Stop after N frames (0 = full video) |
--output_dir |
./output_opensim |
Output directory |
All 3D outputs use OpenSim Y-up convention:
X = camera X (lateral, rightward)
Y = −camera Y (vertical, upward)
Z = camera Z (depth, forward into scene)
Units: metres
run_publisher.py streams pose to OpenSim live via ZMQ at 50 Hz using the mhr2smpl pipeline.
Requires two additional data files not included in this repo:
mhr2smpl/data/SMPL_NEUTRAL.pkl— from https://smpl-x.is.tue.mpg.de/ (free academic registration)mhr2smpl/data/mhr2smpl_mapping.npz— from the MHR repo attools/mhr_smpl_conversion/assets/
Without these, the offline export pipeline works fully.
Input video
│
▼
YOLO v11 pose detector
│ bounding boxes + 2D keypoints
▼
MoGe depth / FOV estimator (TRT, model=s, level=0)
│ camera intrinsics + depth conditioning
▼
DINOv3-ViT/H backbone (TRT, 512×512, FP16)
│ image tokens [B, 1280, 32, 32]
▼
MHR body decoder (torch.compile, 2 intermediate layers)
│ pred_keypoints_3d [70, 3]
│ pred_cam_t [3]
│ pred_vertices [18439, 3]
▼
OpenSim exporter (sam_3d_body/export/opensim_exporter.py)
├── write_trc() → .trc
├── write_mot() → .mot
├── write_skeleton_glb() → _skeleton.glb (skeletal skinning, O(N × 24))
└── write_mesh_glb() → _mesh.glb (morph targets, opt-in)
| File | Contents |
|---|---|
| HOW_TO_RUN.md | Run commands, all flags, focal length calculation, OpenSim workflow |
| SETUP.md | Linux install, all GPU variants (5090/5070Ti/5070/4090/A3000…), TRT build |
| WINDOWS_SETUP.md | Windows install, cmd/PowerShell commands, Windows-specific issues |
| SETTINGS.md | Every environment variable, TRT engine specs, benchmark table |
| COMPROMISES.md | Every accuracy trade-off made to reach 14.7 fps, with measured numbers |
If you use this OpenSim extension, please also cite the upstream Fast-SAM-3D-Body paper:
@article{yang2026fastsam3dbody,
title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian
and Zou, Chuhang and Wang, Yue},
journal={arXiv preprint arXiv:2603.15603},
year={2026}
}Timing Yang1, Sicheng He1, Hongyi Jing1, Jiawei Yang1, Zhijian Liu2,3, Chuhang Zou4†, Yue Wang1,3†
1USC Physical Superintelligence (PSI) Lab 2University of California, San Diego 3NVIDIA 4Meta Reality Labs
† Joint corresponding authors
Speed-accuracy overview of Fast SAM 3D Body. Top left: Qualitative results on in-the-wild images show our framework preserves high-fidelity reconstruction. Top right: Our method achieves up to a 10.25x end-to-end speedup over SAM 3D Body and replaces the iterative MHR-to-SMPL bottleneck with a 10,000x faster neural mapping. Bottom: Our system enables real-time humanoid robot control from a single RGB stream at ~65 ms per frame on an NVIDIA RTX 5090.
SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
Qualitative comparison. The original SAM 3D Body (left) and our Fast variant (right) yield visually comparable mesh reconstructions across diverse poses and multi-person scenes on 3DPW and EMDB.
Please refer to SAM 3D Body for environment setup, or use our setup script:
bash setup_env.sh
conda activate fast_sam_3d_bodycheckpoints/
├── sam-3d-body-dinov3/ # Auto-downloaded from HuggingFace on first run
│ ├── model.ckpt
│ └── assets/
│ └── mhr_model.pt
├── yolo/ # Place YOLO-Pose weights here
│ ├── yolo11m-pose.pt
│ └── yolo11m-pose.engine # Generated by convert_yolo_pose_trt.py (optional)
└── moge_trt/ # Generated by build_tensorrt.sh (optional)
└── moge_dinov2_encoder_fp16.engine
# Optimized (torch.compile + TensorRT)
bash run_demo.sh# Convert all models (YOLO-Pose + MoGe encoder + DINOv3 backbone)
bash build_tensorrt.sh
# Or convert individually
python convert_yolo_pose_trt.py --model yolo11m-pose.pt --imgsz 640 --half
python convert_moge_encoder_trt.py --all
python convert_backbone_tensorrt.py --allAll generated engines are stored under ./checkpoints/.
For instructions on running the publisher, see docs/realworld_deployment.md.
We demonstrate a real-time, vision-only teleoperation system for the Unitree G1 humanoid robot using a single RGB camera, operating at ~65 ms end-to-end latency on an NVIDIA RTX 5090.
Humanoid teleoperation. The system tracks diverse whole-body motions including upper-body gestures (a), body rotations (b-e), walking (f), wide stance (g), single-leg standing (h), squatting (i), and kneeling (j).
Humanoid policy rollout. The robot grasps a box on the table with both hands, squats down, and steps to the right. Achieving 80% task success rate with 40 demonstrations collected via our system.
Single-View vs Multi-View. Multi-view fusion resolves depth ambiguities inherent in single-view reconstruction, producing more accurate SMPL body estimates.
@article{yang2026fastsam3dbody,
title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian and Zou, Chuhang and Wang, Yue},
journal={arXiv preprint arXiv:2603.15603},
year={2026}
}This project builds upon SAM 3D Body (3DB) and Multi-HMR (MHR). We thank the original authors for releasing their models and codebases, which served as the foundation for our acceleration framework.




