This repository provides a hands-on example of classifying hand gestures based on hand pose data extracted from video frames. It was originally developed as part of the ESSLLI 2025 lecture: 👉 https://github.com/aluecking/ESSLLI2025
The example focuses on human–object interaction gestures, demonstrating a lightweight, end-to-end pipeline — from data preparation to live action recognition.
A small subset of the Moments in Time dataset is used, limited to the following action classes:
cyclingrunningdrinkingeating
The extracted hand pose data (computed via MMPose) is included in the repository under:
data/results_hands
We compute a set of interpretable, low-dimensional features from hand pose data (see HandPoseFeatureGenerator.py).
These features are designed to capture finger extension, pinch behavior, and hand configuration, which help distinguish between actions such as eating and drinking.
Note: The
HandPoseFeatureGeneratorclass was initially generated using Claude.ai for demonstration purposes.
| Feature Name | Description |
|---|---|
thumb_extension |
Normalized distance from thumb base to tip (0 = curled, 1 = fully extended) |
index_extension |
Normalized distance from index base to tip |
middle_extension |
Normalized distance from middle finger base to tip |
ring_extension |
Normalized distance from ring finger base to tip |
pinky_extension |
Normalized distance from pinky base to tip |
fingers_extended_count |
Number of fingers considered “extended” (extension > 0.5) |
avg_finger_extension |
Average of all finger extension ratios |
pinch_distance |
Euclidean distance between thumb and index fingertips (in pixels) |
is_pinching |
Binary indicator (1 if pinch distance < 30 px, else 0) |
This experiment uses a simple neural classification pipeline based on scikit-learn:
StandardScaler → MLPClassifier
-
StandardScaler Normalizes features to zero mean and unit variance, ensuring equal contribution from all features.
-
MLPClassifier A lightweight feedforward neural network that learns nonlinear relationships between hand pose features and gesture labels.
conda create -n ubtt python=3.8
conda activate ubttconda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 cpuonly -c pytorch
pip install -U openmim
mim install mmengine
pip install "mmcv==2.1.0"
mim install "mmdet==3.2.0"
mim install "mmpose==1.3.2"
pip install -r requirements.txtMain entry point for the hands-on example. Runs the full pipeline:
- Prepare the dataset
- Extract hand pose data
- Compute pose-based features
- Train and evaluate a gesture classifier
Defines the feature extraction class that converts raw hand pose keypoints into numerical descriptors for classification.
Implements a real-time gesture recognizer using webcam input and the trained model.
TODO