FastSAM3DToOpenSim

OpenSim biomechanics extension of Fast SAM 3D Body

Takes the Fast-SAM-3D-Body inference pipeline and exports every frame of a video directly to OpenSim-ready files: TRC marker trajectories, IK-solved MOT joint angles, body model, animated body mesh GLB, and anatomical bone GLB. Matches the output format of SAM3D-OpenSim.

What this adds on top of Fast-SAM-3D-Body

Feature	Fast-SAM-3D-Body	This repo
3D body mesh inference	✓	✓
Annotated video output	✓	✓
OpenSim TRC marker file (39–79 markers, mm)	—	✓
OpenSim IK-solved MOT (40 DOF, via OpenSim 4.5)	—	✓
Pose2Sim Wholebody body model	—	✓
Animated full-body mesh GLB (morph targets)	—	✓ (opt-in)
Anatomical bone GLB (OpenSim .vtp meshes + IK)	—	✓
Multi-person tracking (BoT-SORT)	—	✓
Per-person TRC / GLB + combined scene GLB	—	✓
Floor lean correction (MoGe depth + spine)	—	✓
Timestamped output folders	—	✓
Full setup guides for Linux + Windows + Docker	—	✓
TRT engine build instructions	partial	✓

Output files

Each run creates a timestamped folder: output_YYYYMMDD_HHMMSS_<videoname>/

output_20260320_173750_myvideo/
  markers_<name>_skeleton.mp4      — annotated video with 2D skeleton overlay
  markers_<name>.trc               — OpenSim markers in mm, Y-up (39 body / 79 full mode)
  markers_<name>_ik.mot            — IK-solved joint angles, 40 DOF, degrees
  markers_<name>_model.osim        — Pose2Sim Wholebody body model
  markers_<name>_mesh.glb          — animated full-body mesh + skeleton overlay (skip with --no_mesh_glb)
  markers_<name>_anatomical.glb    — OpenSim anatomical bones animated by IK
  _ik_marker_errors.sto            — IK marker tracking residuals per frame
  inference_meta.json              — video metadata
  video_outputs.json               — per-frame raw 3D keypoints
  processing_report.json           — pipeline summary: timings, IK/GLB status

With --multi_person, additional per-person files are generated:

  markers_<name>_person01.trc          — per-person TRC
  markers_<name>_person02.trc
  markers_<name>_person01_mesh.glb     — per-person mesh GLB
  markers_<name>_person02_mesh.glb
  markers_<name>_combined.trc          — all persons in one TRC (world-space offsets)
  markers_<name>_combined_mesh.glb     — all persons in one GLB scene (distinct colors)

TRC marker set

Body (30): Nose · LEye · REye · LEar · REar · LShoulder · RShoulder · LElbow · RElbow · LHip · RHip · LKnee · RKnee · LAnkle · RAnkle · LBigToe · LSmallToe · LHeel · RBigToe · RSmallToe · RHeel · RWrist · LWrist · LOlecranon · ROlecranon · LCubitalFossa · RCubitalFossa · LAcromion · RAcromion · Neck

Derived (3): PelvisCenter · Thorax · SpineMid

Spine joints (6): c_spine0 · c_spine1 · c_spine2 · c_spine3 · c_neck · c_head (real MHR 127-joint armature positions, not geometric interpolation)

Hands (40): full finger tracking (20 per hand) — only present with --inference_type full

Total: 39 markers in body mode, 79 markers in full mode.

MOT joint angles — 40 DOF

OpenSim IK-solved via InverseKinematicsTool using the Pose2Sim Wholebody model. Columns: pelvis tx/ty/tz/tilt/list/rotation · l/r hip flexion/adduction/rotation · l/r knee angle · l/r ankle angle · lumbar extension/bending/rotation · arm flex/add/rot · elbow flex · pro/sup · wrist flex/dev (both sides).

Performance (measured, RTX 5090 Laptop, Linux, 848×480)

Mode	Inference FPS	Total time (19.5 s video)
`body` — no hands	~14 fps	~50 s
`full` — body + hands (IK-ready)	~5.3 fps	~115 s

Total time includes inference, post-processing, OpenSim IK, and GLB export.

See COMPROMISES.md for a breakdown of every trade-off.

Quick start

1. Install

Linux: SETUP.md
Windows: WINDOWS_SETUP.md

Additional dependency for IK:

# OpenSim 4.5 (IK solver — optional, TRC is always written)
conda create -n opensim python=3.10
conda install -n opensim -c opensim-org opensim

2. Run

conda activate fast_sam_3d_body

SKIP_KEYPOINT_PROMPT=1 FOV_TRT=1 FOV_FAST=1 FOV_MODEL=s FOV_LEVEL=0 \
USE_TRT_BACKBONE=1 USE_COMPILE=1 DECODER_COMPILE=1 COMPILE_MODE=reduce-overhead \
MHR_NO_CORRECTIVES=1 GPU_HAND_PREP=1 BODY_INTERM_PRED_LAYERS=0,2 \
DEBUG_NAN=0 PARALLEL_DECODERS=0 COMPILE_WARMUP_BATCH_SIZES=1 \
python demo_video_opensim.py \
    --video_path ./videos/my_video.mp4 \
    --detector_model checkpoints/yolo/yolo11m-pose.engine \
    --inference_type full \
    --fx 1371

Replace --fx 1371 with your camera focal length in pixels (see HOW_TO_RUN.md for how to compute it). If unknown, omit --fx and the pipeline will estimate it from the image.

3. Open in OpenSim

The IK MOT is written automatically. Load directly without re-running IK:

File → Open Model → select markers_<name>_model.osim
File → Load Motion → select markers_<name>_ik.mot
File → Open Motion Capture Data → select markers_<name>.trc to inspect markers

4. Open in Blender / 3D viewer

File → Import → glTF 2.0 → select markers_<name>_mesh.glb (body mesh + skeleton) or markers_<name>_anatomical.glb (OpenSim bone meshes)

Key flags

Flag	Default	Description
`--video_path`	—	Input video
`--fx`	auto (MoGe)	Camera focal length in pixels
`--inference_type`	`body`	`body` = faster, fewer markers · `full` = body + hands (73 markers)
`--person_height`	`1.75`	Known subject height in metres — scales 3D output
`--no_mesh_glb`	off	Skip full-body mesh GLB export (saves ~125 MB)
`--target_fps`	`30`	Downsample input to this FPS (0 = every frame)
`--max_frames`	`0`	Stop after N frames (0 = full video)
`--output_dir`	auto	Output directory (default: `output_YYYYMMDD_HHMMSS_<name>/`)

Multi-person

Flag	Default	Description
`--multi_person`	off	Enable multi-person tracking and per-person export
`--tracker`	`botsort`	Tracker: `botsort` (re-ID), `bytetrack` (lighter), `none` (centroid)
`--person_heights`	—	Comma-separated heights left-to-right, e.g. `1.69,1.82`
`--max_persons`	`6`	Max detections per frame
`--run_ik_per_person`	off	Run OpenSim Scale + IK for each person (slow)

Lean / floor correction

Flag	Default	Description
`--floor_moge`	off	Estimate floor plane from MoGe depth (frame 0) to correct camera pitch
`--lean_ref_frame`	—	Frame index where person stands upright (corrects residual lean)
`--lean_angle`	—	Manual lean correction in degrees (overrides auto-detection)
`--no_lean_fix`	off	Disable all automatic lean correction

See HOW_TO_RUN.md for the complete flag reference.

Coordinate system (TRC)

All 3D outputs use OpenSim Y-up convention:

X = forward (anterior)
Y = up (superior)
Z = right (lateral)
Units: millimetres (mm)

Post-processing pipeline

Camera-space keypoints  (N, 70, 3)  +  joint coords  (N, 127, 3)
    │
    ▼  PostProcessor
    │    ├─ interpolate missing frames
    │    └─ Butterworth low-pass filter  (6 Hz, order 4)
    ▼  CoordinateTransformer
    │    ├─ rotate camera → OpenSim Y-up
    │    ├─ scale to subject height (c_head as top reference)
    │    ├─ apply global XZ translation from camera trajectory
    │    ├─ align feet to ground (Y=0) per frame
    │    ├─ MoGe floor-plane correction (optional, --floor_moge)
    │    └─ spine-based forward-lean correction (auto or --lean_ref_frame)
    ▼  KeypointConverter   (MHR70 → 73 OpenSim markers + spine joints)
    ▼  TRCExporter         → markers_<name>.trc  (mm)
    ▼  OpenSim Scale + IK  → markers_<name>_ik.mot  (subprocess → opensim env)
    ▼  write_mesh_glb()    → markers_<name>_mesh.glb  (morph targets + skeleton overlay)
    ▼  write_anatomical_glb() → markers_<name>_anatomical.glb  (OpenSim bone meshes + IK)

In --multi_person mode, the pipeline runs per-person through PostProcessor → IK, then generates per-person TRC/GLB files and a combined scene.

Real-time ZMQ streaming

run_publisher.py streams pose to OpenSim live via ZMQ at 50 Hz using the mhr2smpl pipeline. Requires two additional data files not included in this repo:

mhr2smpl/data/SMPL_NEUTRAL.pkl — from https://smpl-x.is.tue.mpg.de/ (free academic registration)
mhr2smpl/data/mhr2smpl_mapping.npz — from the MHR repo at tools/mhr_smpl_conversion/assets/

Without these, the offline export pipeline works fully.

Documentation index

File	Contents
HOW_TO_RUN.md	Run commands, all flags, focal length calculation, OpenSim workflow, dependency setup
SETUP.md	Linux install, all GPU variants (5090/5070Ti/5070/4090/A3000…), TRT build
WINDOWS_SETUP.md	Windows install, cmd/PowerShell commands, Windows-specific issues
SETTINGS.md	Every environment variable, TRT engine specs, benchmark table
COMPROMISES.md	Every accuracy trade-off made to reach these speeds, with measured numbers

Citation

If you use this OpenSim extension, please also cite the upstream Fast-SAM-3D-Body paper:

@article{yang2026fastsam3dbody,
  title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
  author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian
          and Zou, Chuhang and Wang, Yue},
  journal={arXiv preprint arXiv:2603.15603},
  year={2026}
}

Fast SAM 3D Body (upstream)

Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Timing Yang¹, Sicheng He¹, Hongyi Jing¹, Jiawei Yang¹, Zhijian Liu^2,3, Chuhang Zou⁴^†, Yue Wang^1,3^†

¹USC Physical Superintelligence (PSI) Lab ²University of California, San Diego ³NVIDIA ⁴Meta Reality Labs

^† Joint corresponding authors

Speed-accuracy overview of Fast SAM 3D Body. Top left: Qualitative results on in-the-wild images show our framework preserves high-fidelity reconstruction. Top right: Our method achieves up to a 10.25x end-to-end speedup over SAM 3D Body and replaces the iterative MHR-to-SMPL bottleneck with a 10,000x faster neural mapping. Bottom: Our system enables real-time humanoid robot control from a single RGB stream at ~65 ms per frame on an NVIDIA RTX 5090.

Abstract

SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.

Qualitative comparison. The original SAM 3D Body (left) and our Fast variant (right) yield visually comparable mesh reconstructions across diverse poses and multi-person scenes on 3DPW and EMDB.

Getting Started

Environment

Please refer to SAM 3D Body for environment setup, or use our setup script:

bash setup_env.sh
conda activate fast_sam_3d_body

Checkpoints

checkpoints/
├── sam-3d-body-dinov3/       # Auto-downloaded from HuggingFace on first run
│   ├── model.ckpt
│   └── assets/
│       └── mhr_model.pt
├── yolo/                     # Place YOLO-Pose weights here
│   ├── yolo11m-pose.pt
│   └── yolo11m-pose.engine   # Generated by convert_yolo_pose_trt.py (optional)
└── moge_trt/                 # Generated by build_tensorrt.sh (optional)
    └── moge_dinov2_encoder_fp16.engine

Run

# Optimized (torch.compile + TensorRT)
bash run_demo.sh

TensorRT Acceleration (Optional)

# Convert all models (YOLO-Pose + MoGe encoder + DINOv3 backbone)
bash build_tensorrt.sh

# Or convert individually
python convert_yolo_pose_trt.py --model yolo11m-pose.pt --imgsz 640 --half
python convert_moge_encoder_trt.py --all
python convert_backbone_tensorrt.py --all

All generated engines are stored under ./checkpoints/.

Real-World Deployment

For instructions on running the publisher, see docs/realworld_deployment.md.

We demonstrate a real-time, vision-only teleoperation system for the Unitree G1 humanoid robot using a single RGB camera, operating at ~65 ms end-to-end latency on an NVIDIA RTX 5090.

Humanoid teleoperation. The system tracks diverse whole-body motions including upper-body gestures (a), body rotations (b-e), walking (f), wide stance (g), single-leg standing (h), squatting (i), and kneeling (j).

Humanoid policy rollout. The robot grasps a box on the table with both hands, squats down, and steps to the right. Achieving 80% task success rate with 40 demonstrations collected via our system.

Single-View vs Multi-View. Multi-view fusion resolves depth ambiguities inherent in single-view reconstruction, producing more accurate SMPL body estimates.

Citation

@article{yang2026fastsam3dbody,
  title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
  author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian and Zou, Chuhang and Wang, Yue},
  journal={arXiv preprint arXiv:2603.15603},
  year={2026}
}

Acknowledgements

This project builds upon SAM 3D Body (3DB) and Multi-HMR (MHR). We thank the original authors for releasing their models and codebases, which served as the foundation for our acceleration framework.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
aws		aws
checkpoints/sam-3d-body-dinov3		checkpoints/sam-3d-body-dinov3
data		data
docker		docker
docs		docs
mhr2smpl		mhr2smpl
mocap		mocap
notebook		notebook
sam_3d_body		sam_3d_body
scripts		scripts
tools		tools
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
COMPROMISES.md		COMPROMISES.md
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
HOW_TO_RUN.md		HOW_TO_RUN.md
README.md		README.md
SETTINGS.md		SETTINGS.md
SETUP.md		SETUP.md
WINDOWS_SETUP.md		WINDOWS_SETUP.md
build_tensorrt.sh		build_tensorrt.sh
convert_backbone_tensorrt.py		convert_backbone_tensorrt.py
convert_moge_encoder_trt.py		convert_moge_encoder_trt.py
convert_yolo_pose_trt.py		convert_yolo_pose_trt.py
debug_smpl_stream.py		debug_smpl_stream.py
demo.py		demo.py
demo_human.py		demo_human.py
demo_video.py		demo_video.py
demo_video_opensim.py		demo_video_opensim.py
demo_video_opensim_bis.py		demo_video_opensim_bis.py
docker-compose.yml		docker-compose.yml
export_glb_skely.py		export_glb_skely.py
profile_nsight.py		profile_nsight.py
record_realsense.py		record_realsense.py
record_realsense_multi.py		record_realsense_multi.py
run_demo.sh		run_demo.sh
run_multiview_publisher.py		run_multiview_publisher.py
run_publisher.py		run_publisher.py
setup_env.sh		setup_env.sh
viewer.html		viewer.html

Folders and files

Latest commit

History

Repository files navigation

FastSAM3DToOpenSim

What this adds on top of Fast-SAM-3D-Body

Output files

TRC marker set

MOT joint angles — 40 DOF

Performance (measured, RTX 5090 Laptop, Linux, 848×480)

Quick start

1. Install

2. Run

3. Open in OpenSim

4. Open in Blender / 3D viewer

Key flags

Multi-person

Lean / floor correction

Coordinate system (TRC)

Post-processing pipeline

Real-time ZMQ streaming

Documentation index

Citation

Fast SAM 3D Body (upstream)

Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Abstract

Getting Started

Environment

Checkpoints

Run

TensorRT Acceleration (Optional)

Real-World Deployment

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages