Find the full report with visualizations here.
Human pose estimation is a fundamental computer vision task with applications in sports analytics, robotics, and augmented reality. This project compares three distinct methodologies, analyzing the tradeoffs between accuracy and computational efficiency.
This repository contains the implementation of three different approaches for multi-person 2D pose estimation:
-
Custom Top-Down Architecture - Using Swin Transformer [1] backbones and YOLOv8 for person detection
- Implemented by @npragin in
top_down/
- Implemented by @npragin in
-
Bottom-Up Architecture - Inspired by OpenPose with Part Affinity Fields [2]
- Implemented by @AandBstudent in
bottom_up/
- Implemented by @AandBstudent in
-
End-to-End Transformer-based Architecture - Based on PETR (Pose Estimation with TRansformers) [3]
- Adapted for MPII by @BryanZChen in
e2e/
- Adapted for MPII by @BryanZChen in
- Background:
- Top-down approaches detect individual people in the image and feed each person into a single-person pose estimator
- Architecture:
- Two-stage pipeline using YOLO-v8 for person detection
- Swin Transformer backbone (Swin-S for 50M model, Swin-B for 100M model)
- Two MLP heads for keypoint visibility classification and coordinate regression
- Training:
- Loss function combining binary cross-entropy and smooth L1 loss
- Background:
- Bottom-up approaches identify all keypoints present in the full image, then group them into individual people
- Architecture:
- ResNet-50 backbone followed by iterative refinement stages
- Predicts heatmaps for joint locations and Part Affinity Fields (PAFs) for limb connections
- Multi-stage refinement to enhance prediction accuracy
- Training:
- Optimizes using MSE loss between predicted and ground-truth heatmaps/PAFs
- Background:
- End-to-end approaches directly process the input image in a single unified network without explicitly separating the person detection and keypoint localization steps
- Architecture:
- Based on Pose Estimation with TRansformers (PETR)
- Uses ResNet-50 or Swin-B backbone
- Directly predicts human keypoints from multi-scale feature maps
- Training:
- Direct optimization on OKS loss (L = 1 - OKS)
Object Keypoint Similarity (OKS) is the primary metric for evaluating human pose estimation performance. OKS is analogous to IoU (Intersection over Union) in object detection but designed explicitly for keypoint-based tasks.
The OKS between a predicted pose and a ground truth pose is calculated as:
Where:
-
$d_i$ is the Euclidean distance between the predicted keypoint and ground truth keypoint -
$s$ is the object scale (square root of the person bounding box area) -
$k_i$ is a per-keypoint constant that controls relative importance (different joints have different importance) -
$v_i$ is the visibility flag of the ground truth keypoint -
$\delta(v_i > 0)$ is 1 if the keypoint is visible, 0 otherwise
Higher OKS values indicate better alignment between predicted and ground truth poses, with 1.0 representing a perfect match.
To ensure a fair comparison across all three architectural approaches, we implemented strict experimental controls:
-
Parameter Normalization: All models were constrained to parameter counts within ±3% of each other to normalize computational capacity. We created variants at two complexity levels, 50M and 100M parameters.
-
Consistent Dataset Handling: All models were trained on identical subsets of the MPII human pose dataset, with seven-eighths used for training and one-eighth reserved for evaluation.
-
Standardized Data Augmentation: We applied the same augmentation techniques across all experiments:
- Color jittering with uniform hyperparameters
- Standardized image sub-sampling to predetermined dimensions
- Normalization using ImageNet statistical parameters
| Model | Train OKS | Test OKS |
|---|---|---|
| Custom Top-down 50M | 0.670 | 0.622 |
| Custom Top-down 100M | 0.732 | 0.668 |
| SOTA E2E 50M | 0.917 | 0.896 |
| SOTA E2E 100M | 0.903 | 0.897 |
| SOTA Bottom-up 50M | 0.470 | 0.237 |
| SOTA Bottom-up 100M | 0.441 | 0.324 |
The end-to-end detection model (PETR) significantly outperformed both top-down and bottom-up approaches in our experiments.
-
End-to-end models perform best for multi-person pose estimation tasks
-
Parameter scaling effects vary across architectures
- Top-down: +7.4% OKS improvement
- End-to-end: Minimal impact (+0.1% OKS)
- Bottom-up: +36.7% OKS improvement, but from a lower baseline
-
Multi-scale feature maps and positionally invariant decoders are critical for accurate pose estimation
- Their absence likely contributes to our custom top-down model's lower performance.
-
Top-down approaches offer good accuracy, but computation at inference scales linearly with the number of people in the image
-
Bottom-up approaches suffer in accuracy but maintain computational efficiency regardless of person count
-
Practical deployment considerations
- For high-accuracy requirements (medical, sports analytics): End-to-end models are recommended
- For real-time applications with moderate accuracy needs in high-density environments: Bottom-up models provide the highest throughput
- For balanced performance: End-to-end models with the 50M parameter configuration offer the best tradeoff
- Implement rotation and translation data augmentation for emphasis on rotational and positional invariance and equivariance
- Implement multi-scale feature pyramid networks to better handle various human scales
- Develop positionally and rotationally invariant/equivariant decoders to improve generalization
- Implement dynamic bounding box computation to replace the fixed scaling factor
[1] Z. Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," ICCV 2021.
[2] Z. Cao et al., "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields," CVPR 2017.
[3] D. Shi et al., "End-to-End Multi-Person Pose Estimation with Transformers," CVPR 2022.
This project is released under the GPLv3 License.
