OpenSim biomechanics extension of Fast SAM 3D Body
Takes the Fast-SAM-3D-Body inference pipeline and exports every frame of a video directly to OpenSim-ready files: TRC marker trajectories, IK-solved MOT joint angles, body model, animated body mesh GLB, and anatomical bone GLB. Matches the output format of SAM3D-OpenSim.
| Feature | Fast-SAM-3D-Body | This repo |
|---|---|---|
| 3D body mesh inference | ✓ | ✓ |
| Annotated video output | ✓ | ✓ |
| OpenSim TRC marker file (39–79 markers, mm) | — | ✓ |
| OpenSim IK-solved MOT (40 DOF, via OpenSim 4.5) | — | ✓ |
| Pose2Sim Wholebody body model | — | ✓ |
| Animated full-body mesh GLB (morph targets) | — | ✓ (opt-in) |
| Anatomical bone GLB (OpenSim .vtp meshes + IK) | — | ✓ |
| Multi-person tracking (BoT-SORT) | — | ✓ |
| Per-person TRC / GLB + combined scene GLB | — | ✓ |
| Floor lean correction (MoGe depth + spine) | — | ✓ |
| Timestamped output folders | — | ✓ |
| Full setup guides for Linux + Windows + Docker | — | ✓ |
| TRT engine build instructions | partial | ✓ |
Each run creates a timestamped folder: output_YYYYMMDD_HHMMSS_<videoname>/
output_20260320_173750_myvideo/
markers_<name>_skeleton.mp4 — annotated video with 2D skeleton overlay
markers_<name>.trc — OpenSim markers in mm, Y-up (39 body / 79 full mode)
markers_<name>_ik.mot — IK-solved joint angles, 40 DOF, degrees
markers_<name>_model.osim — Pose2Sim Wholebody body model
markers_<name>_mesh.glb — animated full-body mesh + skeleton overlay (skip with --no_mesh_glb)
markers_<name>_anatomical.glb — OpenSim anatomical bones animated by IK
_ik_marker_errors.sto — IK marker tracking residuals per frame
inference_meta.json — video metadata
video_outputs.json — per-frame raw 3D keypoints
processing_report.json — pipeline summary: timings, IK/GLB status
With --multi_person, additional per-person files are generated:
markers_<name>_person01.trc — per-person TRC
markers_<name>_person02.trc
markers_<name>_person01_mesh.glb — per-person mesh GLB
markers_<name>_person02_mesh.glb
markers_<name>_combined.trc — all persons in one TRC (world-space offsets)
markers_<name>_combined_mesh.glb — all persons in one GLB scene (distinct colors)
Body (30): Nose · LEye · REye · LEar · REar · LShoulder · RShoulder · LElbow · RElbow · LHip · RHip · LKnee · RKnee · LAnkle · RAnkle · LBigToe · LSmallToe · LHeel · RBigToe · RSmallToe · RHeel · RWrist · LWrist · LOlecranon · ROlecranon · LCubitalFossa · RCubitalFossa · LAcromion · RAcromion · Neck
Derived (3): PelvisCenter · Thorax · SpineMid
Spine joints (6): c_spine0 · c_spine1 · c_spine2 · c_spine3 · c_neck · c_head (real MHR 127-joint armature positions, not geometric interpolation)
Hands (40): full finger tracking (20 per hand) — only present with --inference_type full
Total: 39 markers in body mode, 79 markers in full mode.
OpenSim IK-solved via InverseKinematicsTool using the Pose2Sim Wholebody model.
Columns: pelvis tx/ty/tz/tilt/list/rotation · l/r hip flexion/adduction/rotation ·
l/r knee angle · l/r ankle angle · lumbar extension/bending/rotation ·
arm flex/add/rot · elbow flex · pro/sup · wrist flex/dev (both sides).
| Mode | Inference FPS | Total time (19.5 s video) |
|---|---|---|
body — no hands |
~14 fps | ~50 s |
full — body + hands (IK-ready) |
~5.3 fps | ~115 s |
Total time includes inference, post-processing, OpenSim IK, and GLB export.
See COMPROMISES.md for a breakdown of every trade-off.
- Linux: SETUP.md
- Windows: WINDOWS_SETUP.md
Additional dependency for IK:
# OpenSim 4.5 (IK solver — optional, TRC is always written)
conda create -n opensim python=3.10
conda install -n opensim -c opensim-org opensimconda activate fast_sam_3d_body
SKIP_KEYPOINT_PROMPT=1 FOV_TRT=1 FOV_FAST=1 FOV_MODEL=s FOV_LEVEL=0 \
USE_TRT_BACKBONE=1 USE_COMPILE=1 DECODER_COMPILE=1 COMPILE_MODE=reduce-overhead \
MHR_NO_CORRECTIVES=1 GPU_HAND_PREP=1 BODY_INTERM_PRED_LAYERS=0,2 \
DEBUG_NAN=0 PARALLEL_DECODERS=0 COMPILE_WARMUP_BATCH_SIZES=1 \
python demo_video_opensim.py \
--video_path ./videos/my_video.mp4 \
--detector_model checkpoints/yolo/yolo11m-pose.engine \
--inference_type full \
--fx 1371Replace --fx 1371 with your camera focal length in pixels
(see HOW_TO_RUN.md for how to compute it).
If unknown, omit --fx and the pipeline will estimate it from the image.
The IK MOT is written automatically. Load directly without re-running IK:
File → Open Model→ selectmarkers_<name>_model.osimFile → Load Motion→ selectmarkers_<name>_ik.motFile → Open Motion Capture Data→ selectmarkers_<name>.trcto inspect markers
File → Import → glTF 2.0 → select markers_<name>_mesh.glb (body mesh + skeleton) or markers_<name>_anatomical.glb (OpenSim bone meshes)
| Flag | Default | Description |
|---|---|---|
--video_path |
— | Input video |
--fx |
auto (MoGe) | Camera focal length in pixels |
--inference_type |
body |
body = faster, fewer markers · full = body + hands (73 markers) |
--person_height |
1.75 |
Known subject height in metres — scales 3D output |
--no_mesh_glb |
off | Skip full-body mesh GLB export (saves ~125 MB) |
--target_fps |
30 |
Downsample input to this FPS (0 = every frame) |
--max_frames |
0 |
Stop after N frames (0 = full video) |
--output_dir |
auto | Output directory (default: output_YYYYMMDD_HHMMSS_<name>/) |
| Flag | Default | Description |
|---|---|---|
--multi_person |
off | Enable multi-person tracking and per-person export |
--tracker |
botsort |
Tracker: botsort (re-ID), bytetrack (lighter), none (centroid) |
--person_heights |
— | Comma-separated heights left-to-right, e.g. 1.69,1.82 |
--max_persons |
6 |
Max detections per frame |
--run_ik_per_person |
off | Run OpenSim Scale + IK for each person (slow) |
| Flag | Default | Description |
|---|---|---|
--floor_moge |
off | Estimate floor plane from MoGe depth (frame 0) to correct camera pitch |
--lean_ref_frame |
— | Frame index where person stands upright (corrects residual lean) |
--lean_angle |
— | Manual lean correction in degrees (overrides auto-detection) |
--no_lean_fix |
off | Disable all automatic lean correction |
See HOW_TO_RUN.md for the complete flag reference.
All 3D outputs use OpenSim Y-up convention:
X = forward (anterior)
Y = up (superior)
Z = right (lateral)
Units: millimetres (mm)
Camera-space keypoints (N, 70, 3) + joint coords (N, 127, 3)
│
▼ PostProcessor
│ ├─ interpolate missing frames
│ └─ Butterworth low-pass filter (6 Hz, order 4)
▼ CoordinateTransformer
│ ├─ rotate camera → OpenSim Y-up
│ ├─ scale to subject height (c_head as top reference)
│ ├─ apply global XZ translation from camera trajectory
│ ├─ align feet to ground (Y=0) per frame
│ ├─ MoGe floor-plane correction (optional, --floor_moge)
│ └─ spine-based forward-lean correction (auto or --lean_ref_frame)
▼ KeypointConverter (MHR70 → 73 OpenSim markers + spine joints)
▼ TRCExporter → markers_<name>.trc (mm)
▼ OpenSim Scale + IK → markers_<name>_ik.mot (subprocess → opensim env)
▼ write_mesh_glb() → markers_<name>_mesh.glb (morph targets + skeleton overlay)
▼ write_anatomical_glb() → markers_<name>_anatomical.glb (OpenSim bone meshes + IK)
In --multi_person mode, the pipeline runs per-person through PostProcessor → IK,
then generates per-person TRC/GLB files and a combined scene.
run_publisher.py streams pose to OpenSim live via ZMQ at 50 Hz using the mhr2smpl pipeline.
Requires two additional data files not included in this repo:
mhr2smpl/data/SMPL_NEUTRAL.pkl— from https://smpl-x.is.tue.mpg.de/ (free academic registration)mhr2smpl/data/mhr2smpl_mapping.npz— from the MHR repo attools/mhr_smpl_conversion/assets/
Without these, the offline export pipeline works fully.
| File | Contents |
|---|---|
| HOW_TO_RUN.md | Run commands, all flags, focal length calculation, OpenSim workflow, dependency setup |
| SETUP.md | Linux install, all GPU variants (5090/5070Ti/5070/4090/A3000…), TRT build |
| WINDOWS_SETUP.md | Windows install, cmd/PowerShell commands, Windows-specific issues |
| SETTINGS.md | Every environment variable, TRT engine specs, benchmark table |
| COMPROMISES.md | Every accuracy trade-off made to reach these speeds, with measured numbers |
If you use this OpenSim extension, please also cite the upstream Fast-SAM-3D-Body paper:
@article{yang2026fastsam3dbody,
title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian
and Zou, Chuhang and Wang, Yue},
journal={arXiv preprint arXiv:2603.15603},
year={2026}
}Timing Yang1, Sicheng He1, Hongyi Jing1, Jiawei Yang1, Zhijian Liu2,3, Chuhang Zou4†, Yue Wang1,3†
1USC Physical Superintelligence (PSI) Lab 2University of California, San Diego 3NVIDIA 4Meta Reality Labs
† Joint corresponding authors
Speed-accuracy overview of Fast SAM 3D Body. Top left: Qualitative results on in-the-wild images show our framework preserves high-fidelity reconstruction. Top right: Our method achieves up to a 10.25x end-to-end speedup over SAM 3D Body and replaces the iterative MHR-to-SMPL bottleneck with a 10,000x faster neural mapping. Bottom: Our system enables real-time humanoid robot control from a single RGB stream at ~65 ms per frame on an NVIDIA RTX 5090.
SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
Qualitative comparison. The original SAM 3D Body (left) and our Fast variant (right) yield visually comparable mesh reconstructions across diverse poses and multi-person scenes on 3DPW and EMDB.
Please refer to SAM 3D Body for environment setup, or use our setup script:
bash setup_env.sh
conda activate fast_sam_3d_bodycheckpoints/
├── sam-3d-body-dinov3/ # Auto-downloaded from HuggingFace on first run
│ ├── model.ckpt
│ └── assets/
│ └── mhr_model.pt
├── yolo/ # Place YOLO-Pose weights here
│ ├── yolo11m-pose.pt
│ └── yolo11m-pose.engine # Generated by convert_yolo_pose_trt.py (optional)
└── moge_trt/ # Generated by build_tensorrt.sh (optional)
└── moge_dinov2_encoder_fp16.engine
# Optimized (torch.compile + TensorRT)
bash run_demo.sh# Convert all models (YOLO-Pose + MoGe encoder + DINOv3 backbone)
bash build_tensorrt.sh
# Or convert individually
python convert_yolo_pose_trt.py --model yolo11m-pose.pt --imgsz 640 --half
python convert_moge_encoder_trt.py --all
python convert_backbone_tensorrt.py --allAll generated engines are stored under ./checkpoints/.
For instructions on running the publisher, see docs/realworld_deployment.md.
We demonstrate a real-time, vision-only teleoperation system for the Unitree G1 humanoid robot using a single RGB camera, operating at ~65 ms end-to-end latency on an NVIDIA RTX 5090.
Humanoid teleoperation. The system tracks diverse whole-body motions including upper-body gestures (a), body rotations (b-e), walking (f), wide stance (g), single-leg standing (h), squatting (i), and kneeling (j).
Humanoid policy rollout. The robot grasps a box on the table with both hands, squats down, and steps to the right. Achieving 80% task success rate with 40 demonstrations collected via our system.
Single-View vs Multi-View. Multi-view fusion resolves depth ambiguities inherent in single-view reconstruction, producing more accurate SMPL body estimates.
@article{yang2026fastsam3dbody,
title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian and Zou, Chuhang and Wang, Yue},
journal={arXiv preprint arXiv:2603.15603},
year={2026}
}This project builds upon SAM 3D Body (3DB) and Multi-HMR (MHR). We thank the original authors for releasing their models and codebases, which served as the foundation for our acceleration framework.




