This project is an Image Action Recognition system that uses the Stanford40 action dataset. It allows users to interact with the system through a Graphical User Interface (GUI) to recognize actions performed in images and we use VIT model as the base model. The Stanford40 dataset contains images of 40 different human action classes, such as "applauding", "fishing", "holding an umbrella" etc.
- Features
- Requirements
- Installation
- Demo Images
- Usage
- Training the Model
- Accuracy
- Model
- Project Structure
- Dataset
- License
- Acknowledgments
- Contributing
- Contact
- Recognize actions in images using pre-trained models.
- Interact with the system through a user-friendly GUI.
- Display the recognized action class along with the input image.
- Python 3.x
- Tkinter library for GUI (usually comes pre-installed with Python)
- Clone the repository to your local machine:
$ git clone https://github.com/ARHPA/Image-Action-Recognition.git- Navigate to the project directory:
$ cd Image-Action-Recognition- Install the required dependencies:
$ pip install -r requirements.txt-
Run the GUI application: python GUI.py
-
The GUI will open, allowing you to interact with the system.
-
To analyze an image, click the "Select Image" button and choose an image from your local system.
-
Click the "Analyze image" button to initiate the recognition process.
-
The system will display the recognized action class below the image.
-
Repeat steps 3 to 5 for analyzing more images.
If you want to train this model on your own, follow these instructions (after Installation):
- Download model weights:
$ wget https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/ViT-B_16.npz- Download and unzip dataset
$ wget http://vision.stanford.edu/Datasets/Stanford40.zip
$ unzip Stanford40.zip- Now you can change config.json and then run train.py file:
$ python train.pyAlternatively, you can simply run the train.ipynb notebook
During the training of the image action recognition model, the model achieved the following accuracy on the test set:
- Test Accuracy: 82%
- Top 2 Accuracy: 90%
These accuracy metrics demonstrate the effectiveness of the trained model in recognizing human action classes in images from the Stanford40 dataset.
The image action recognition model used in this project is based on the Vision Transformer (VIT) architecture. The VIT model is a state-of-the-art deep learning model for image classification tasks.
The VIT model implementation used in this project is based on the repository ViT-pytorch by jeonsworld. The pre-trained VIT model provided in this repository is fine-tuned on the Stanford40 dataset for action recognition.
For more details on the architecture and implementation of the VIT model, please refer to ViT-pytorch repository.
The project follows the structure template provided by pytorch-template, which provides a well-organized and scalable project structure for deep learning projects. Below is a brief overview of the project structure:
pytorch-template/
│
├── train.py - main script to start training
├── test.py - evaluation of trained model
├── GUI.py - Graphical User Interface of project
│
├── config.json - holds configuration for training
├── parse_config.py - class to handle config file and cli options
│
├── train.ipynb - train notebook
├── model_best.pth - best model state dicts
│
├── base/ - abstract base classes
│ ├── base_data_loader.py
│ ├── base_model.py
│ └── base_trainer.py
│
├── data_loader/ - create dataset and data loader
│ └── data_loaders.py
│
├── model/ - models, losses, and metrics
│ ├── model.py
│ ├── metric.py
│ └── loss.py
│
├── trainer/ - trainers
│ └── trainer.py
│
├── logger/ - module for tensorboard visualization and logging
│ ├── visualization.py
│ ├── logger.py
│ └── logger_config.json
│
└── utils/ - small utility functions
├── util.py
└── ...
The Stanford40 dataset used in this project contains images of human action classes and their corresponding annotations. You can find more information and access the dataset from the official Stanford40 website: Stanford40 Dataset.
This project is licensed under the MIT License.
- The Stanford40 dataset was created and made publicly available by the Stanford Vision Lab.
- The VIT model used in this project were developed by jeonsworld.
Contributions are welcome! If you find any issues or want to enhance the project, feel free to submit a pull request.
For any inquiries or feedback, please contact ARHPA00@gmail.com.




