A PyTorch-based Convolutional Neural Network for real-time hand detection from webcam feeds. The model classifies images as either containing a hand or not containing a hand.
This project trains a custom CNN to distinguish between images with hands and images without hands, then deploys the model for real-time inference using a webcam.
- Custom CNN architecture with 4 convolutional blocks
- Data augmentation for improved generalization
- Real-time webcam inference
- Batch normalization and dropout for regularization
- Performance metrics including precision, recall, and confusion matrix
torch
torchvision
numpy
seaborn
matplotlib
pandas
opencv-python (cv2)
pillow
scikit-learn
Install dependencies:
pip install torch torchvision numpy seaborn matplotlib pandas opencv-python pillow scikit-learnThe code expects the following directory structure:
hands/
├── dataset1/
│ ├── train/
│ │ └── images/
│ │ ├── *.jpg
│ │ └── *.png
│ └── test/
│ └── images/
│ ├── *.jpg
│ └── *.png
├── dataset2/
│ └── ...
└── ...
Negatives/
├── dataset1/
│ └── images/
│ ├── *.jpg
│ └── *.png
│ └── NoObject/
│ └── *.png
└── ...
hands/: Contains datasets with hand imagesNegatives/: Contains datasets without hand images
This project uses a combination of real and synthetic hand image datasets to train and evaluate a binary hand detection model (hand vs no-hand).
A general-purpose hand detection dataset with bounding box annotations across different poses, lighting conditions, and backgrounds. Used as the primary dataset for training the detector.
https://universe.roboflow.com/yolov4tiny-wzb2k/hand-bo3ot
Contains annotated images of classical Indian hand gestures (mudras). Used to add diversity in hand shapes, orientations, and fine-grained poses.
https://universe.roboflow.com/mudras-avdrb/bharatanatyam-mudras-fg9qo-gcruc
A collection of hand gesture images used for additional data exploration and optional augmentation.
https://universe.roboflow.com/horyzn-qhfq4/hand-gesture-gizg2
A ready-to-use dataset in VOC/YOLO format for standard object detection training and benchmarking.
https://www.kaggle.com/datasets/nomihsa965/hand-detection-dataset-vocyolo-format/data
A synthetic dataset used to improve generalization and robustness, especially for rare poses and edge cases.
https://www.kaggle.com/datasets/zeyadkhalid/hand-detection
These datasets together provide a mix of real-world variability and synthetic augmentation, helping the model generalize better across different environments, poses, and lighting conditions.
HandOrNotCNN:
- 4 convolutional blocks (3→32→64→128→128 channels)
- Each block: Conv2d → BatchNorm → ReLU → MaxPool
- Fully connected layers: 128×7×7 → 64 → 32 → 2
- Dropout (0.5) for regularization
Input: 224×224 RGB images
Output: Binary classification (hand/no_hand)
The model uses:
- Loss: CrossEntropyLoss
- Optimizer: Adam (lr=0.001)
- Data Augmentation: Random horizontal flip, rotation, color jitter
- Normalization: ImageNet mean/std
- Train/Dev Split: 90/10 for negative samples
Run training:
python train.py # or run the notebook cellsTraining outputs:
- Epoch-wise train/dev loss and accuracy
- Loss history plot
- Saved model:
hand_model.pth
The model is evaluated on:
- Accuracy: Overall classification accuracy
- Precision: Correct hand predictions / total hand predictions
- Recall: Correct hand predictions / actual hands
- Confusion Matrix: Visualization of true/false positives/negatives
Run the webcam classifier:
python webcam_inference.py # or run the relevant notebook cellsControls:
- Press
qto quit
The script:
- Loads the trained model (
better_hand_class.pth) - Opens the default webcam
- Processes each frame through the model
- Displays prediction and confidence on screen
Edit these variables in the inference section:
MODEL_PATH = "better_hand_class.pth" # Path to saved model
CLASS_NAMES = ["no_hand", "hand"] # Class labels
INPUT_SIZE = 224 # Input image size
USE_GPU = torch.mps.is_available() # Enable GPU (MPS for Mac, CUDA for others)Metrics:
- Training Accuracy: ~96%
- Dev Accuracy: ~98%
- Precision: ~97%
- Recall: ~99%
Possible improvements and extensions for this project include:
-
Temporal modeling for video streams
Add temporal smoothing or sequence models (e.g., moving average, LSTM, or 3D CNN) to improve stability in video inference. -
Domain adaptation and robustness
Improve generalization to:- Different skin tones
- Gloves, accessories, and occlusions
- Extreme lighting or low-quality cameras
-
Active learning loop
Allow users to correct predictions and continuously improve the model with new labeled data. -
Explainability and debugging tools
Add Grad-CAM or feature visualization to better understand what the model focuses on during prediction. -
Benchmark against pretrained models
Compare performance with pretrained backbones like MobileNet, ResNet, or EfficientNet.
These enhancements would evolve the project from a learning-focused prototype into a more robust, scalable, and production-ready vision system.
hand_model.pth/better_hand_class.pth: Trained model weightsconfusion_matrix.png: Confusion matrix visualization
- The code uses MPS (Metal Performance Shaders) for Mac GPU acceleration
- For NVIDIA GPUs, change device detection to use CUDA
- Adjust batch size based on available memory
- The 90/10 train/dev split is applied only to negative samples (dev ratio = 0.1)
Camera not opening:
- Check if another application is using the webcam
- Try changing
cv2.VideoCapture(0)tocv2.VideoCapture(1)for external cameras
Out of memory errors:
- Reduce batch size (default: 32)
- Reduce image size (default: 224×224)
Low accuracy:
- Increase training epochs
- Add more diverse training data
- Adjust learning rate or augmentation parameters
This project is intended for educational purposes. A license can be added later if the repository is extended or shared for reuse.
Abhi Learning-focused Machine Learning & Python projects