State-of-the-art action recognition for both images and videos. From pose-based image classification to temporal 3D CNNs for video understanding.
π― Two complete pipelines: Single-frame pose classification (90 FPS) + Video temporal modeling (87% UCF-101)
- 87.05% accuracy on UCF-101 with MC3-18
- 3D CNN models (R3D-18, MC3-18) for temporal understanding
- Published on HuggingFace: MC3-18 | R3D-18
- Train your own with complete training pipeline
- 88.5% accuracy on Stanford40 with ResNet50
- 90 FPS real-time inference with MediaPipe + PyTorch
- Pose-aware classification with geometric reasoning
- Pure Python, zero C++ compilation
| Model | Accuracy | Params | FPS | Dataset | Download |
|---|---|---|---|---|---|
| MC3-18 | 87.05% | 11.7M | 30 | UCF-101 (101 classes) | |
| R3D-18 | 83.80% | 33.2M | 40 | UCF-101 (101 classes) |
Input: 16-frame clips @ 112Γ112
Use case: Action classification in video clips (sports, activities, human-object interaction)
π Comparison with Published Baselines
| Method | Published | This Repo | Improvement |
|---|---|---|---|
| R3D-18 | 82.8% | 83.8% | +1.0% β |
| MC3-18 | 85.0% | 87.05% | +2.05% β |
Our models match or exceed original papers!
| Model | Accuracy | Speed | Dataset | Download |
|---|---|---|---|---|
| ResNet50 | 88.5% | 6ms | Stanford40 (40 classes) | |
| ResNet34 | 86.4% | 5ms | Stanford40 | |
| MobileNetV3-Large | 82.1% | 3ms | Stanford40 | |
| ResNet18 | 82.3% | 4ms | Stanford40 |
Input: Single RGB image @ 224Γ224
Use case: Real-time single-frame action classification (fitness, sports, daily activities)
# Core library
pip install -e .
# With demo interface
pip install -e ".[demo]"
# Full installation (training + demo)
pip install -e ".[dev,demo,train]"import torch
from torchvision.transforms import Compose, Resize, CenterCrop, Normalize, ToTensor
# Load video model
model = torch.hub.load('dronefreak/mc3-18-ucf101', 'model', pretrained=True)
model.eval()
# Prepare 16-frame clip (C, T, H, W)
transform = Compose([
Resize((128, 171)),
CenterCrop(112),
ToTensor(),
Normalize(mean=[0.43216, 0.394666, 0.37645],
std=[0.22803, 0.22145, 0.216989])
])
# Inference
with torch.no_grad():
output = model(video_tensor) # (1, 101)
prediction = output.argmax(dim=1)
print(f"Action: {ucf101_classes[prediction]}")π Video Classification CLI
# Classify video clip
python -m hac.video.inference.predict \
--video clip.mp4 \
--model dronefreak/mc3-18-ucf101 \
--num_frames 16
# Real-time webcam
python -m hac.video.inference.predict \
--webcam \
--model dronefreak/mc3-18-ucf101from hac import ActionPredictor
# Initialize with pose estimation
predictor = ActionPredictor(
model_path=None, # Uses pretrained ResNet50
device='cuda',
use_pose_estimation=True
)
# Predict from image
result = predictor.predict_image('person.jpg')
print(f"Pose: {result['pose']['class']}")
print(f"Action: {result['action']['top_class']}")
print(f"Confidence: {result['action']['top_confidence']:.2%}")π₯οΈ Web Demo & CLI
# Launch interactive web demo
hac-demo
# Command line inference
hac-infer --image photo.jpg --model weights/best.pth
# Real-time webcam
hac-infer --webcam --model weights/best.pth# 1. Download UCF-101
wget https://www.crcv.ucf.edu/data/UCF101/UCF101.rar
unrar x UCF101.rar
# 2. Download official splits
wget https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-RecognitionTask.zip
unzip UCF101TrainTestSplits-RecognitionTask.zip
# 3. Organize dataset
python scripts/split_ucf101.py \
--source UCF-101/ \
--output UCF-101-organized/ \
--splits ucfTrainTestlist/ \
--split_num 1
# 4. Train MC3-18
python -m hac.video.training.train \
--data_dir UCF-101-organized/ \
--model mc3_18 \
--pretrained \
--batch_size 32 \
--epochs 200 \
--output_dir outputs/mc3-18Expected results:
- MC3-18: 87% accuracy (200 epochs, 6-7 hours on RTX 4070 Super)
- R3D-18: 84% accuracy (100 epochs, 3-4 hours)
π Advanced Video Training
# With augmentations
python -m hac.video.training.train \
--data_dir UCF-101-organized/ \
--model mc3_18 \
--pretrained \
--batch_size 32 \
--epochs 200 \
--lr 0.001 \
--weight_decay 1e-4 \
--mixup_alpha 0.2 \
--cutmix_alpha 0.5 \
--label_smoothing 0.1
# Try different models
--model r3d_18 # 83.8% accuracy
--model r2plus1d_18 # Alternative architecture
--model mc3_18 # 87% accuracy (best)# 1. Download Stanford40 dataset
# From: http://vision.stanford.edu/Datasets/40actions.html
# 2. Train
python -m hac.image.training.train \
--data_dir data/ \
--model_name resnet50 \
--num_classes 40 \
--epochs 50 \
--batch_size 32Video Clip (16 frames)
β
Frame Preprocessing (112Γ112)
β
3D CNN (MC3-18 / R3D-18)
β
Temporal Convolutions
β
Action Classification (101 classes)
Key features:
- Spatiotemporal convolutions
- Temporal modeling across 16 frames
- Pretrained on Kinetics-400
- Fine-tuned on UCF-101
Image
β
MediaPipe Pose Detection β 33 keypoints
β
Pose Classifier β sitting/standing/lying
β
2D CNN (ResNet50) β Action features
β
Action Classification (40 classes)
Key features:
- Dual-stream: pose + appearance
- Real-time 90 FPS inference
- Geometric pose reasoning
- Any timm model backbone
| Model | Inference Time | FPS | Batch Size |
|---|---|---|---|
| MC3-18 | 33ms | 30 | 1 |
| R3D-18 | 25ms | 40 | 1 |
| Pipeline Stage | Time | FPS |
|---|---|---|
| MediaPipe Pose | 5ms | - |
| ResNet50 CNN | 6ms | - |
| Total | 11ms | 90 |
Comparison: v1.0 (TF 1.13 + OpenPose) = 1400ms β v2.0 (PyTorch + MediaPipe) = 11ms (127Γ faster)
- Sports analysis - Classify basketball, soccer, swimming actions
- Surveillance - Detect abnormal behavior in videos
- Fitness tracking - Recognize workout exercises
- Content moderation - Auto-tag video content
- Fitness coaching - Analyze workout form
- Healthcare - Fall detection, mobility monitoring
- Autonomous vehicles - Pedestrian intent prediction
- Gaming/VR - Body-based game controls
| Dataset | Classes | Videos | Use Case | Models Available |
|---|---|---|---|---|
| UCF-101 | 101 | 13,320 | Video temporal | MC3-18, R3D-18 β |
| Stanford40 | 40 | 9,532 | Image pose-based | ResNet50, MobileNet β |
| Kinetics-400 | 400 | 306K | Pretraining | - |
101 human actions including:
- Sports: Basketball, Soccer, Swimming, Tennis, Volleyball
- Music: Playing Drums, Guitar, Piano, Violin
- Activities: Cooking, Gardening, Typing, Writing
- Body motion: Walking, Running, Jumping, Lunging
40 common human activities:
- applauding, climbing, cooking, cutting_trees, drinking
- fishing, gardening, playing_guitar, pouring_liquid, etc.
- Model Zoo - All available models
- Deployement - Deployement documentation
- Contributing - How to contribute
- Image classification with pose estimation
- Video classification with 3D CNNs
- Published models on HuggingFace
- Training pipelines for both modalities
- CLI and Python API
- Web demo with Gradio
- Two-stream fusion (spatial + temporal)
- Real-time video demo
- HuggingFace Spaces deployment
- ONNX export for production
- Mobile deployment guides
- TensorRT optimization
- Additional datasets (Kinetics, AVA)
- Multi-person action detection
- Action localization in videos
We welcome contributions! High-impact areas:
- π₯ Video demos - Create GIFs/videos showing real-time inference
- π± Mobile deployment - iOS/Android guides
- π Model improvements - Train on Kinetics, optimize architectures
- π Documentation - Tutorials, examples, notebooks
- π Bug fixes - Always appreciated!
See CONTRIBUTING.md for setup and guidelines.
If you use this work, please cite:
@software{saksena2025hac,
author = {Saksena, Saumya Kumaar},
title = {Human Action Classification: Image and Video Understanding},
year = {2025},
url = {https://github.com/dronefreak/human-action-classification}
}Video Models (MC3-18, R3D-18):
@inproceedings{tran2018closer,
title={A closer look at spatiotemporal convolutions for action recognition},
author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
booktitle={CVPR},
year={2018}
}Datasets:
- MediaPipe - Google's pose estimation framework
- timm - Ross Wightman's model library
- PyTorch - Deep learning framework
- UCF-101 & Stanford40 - Dataset creators
- Original repo contributors - 233+ stars!
Author: Saumya Kumaar Saksena
GitHub: @dronefreak
Models: HuggingFace
Apache License 2.0 - See LICENSE for details.