Image Action Recognition with Stanford40 Dataset and GUI

This project is an Image Action Recognition system that uses the Stanford40 action dataset. It allows users to interact with the system through a Graphical User Interface (GUI) to recognize actions performed in images and we use VIT model as the base model. The Stanford40 dataset contains images of 40 different human action classes, such as "applauding", "fishing", "holding an umbrella" etc.

Features

Recognize actions in images using pre-trained models.
Interact with the system through a user-friendly GUI.
Display the recognized action class along with the input image.

Requirements

Python 3.x
Tkinter library for GUI (usually comes pre-installed with Python)

Installation

Clone the repository to your local machine:

$ git clone https://github.com/ARHPA/Image-Action-Recognition.git

Navigate to the project directory:

$ cd Image-Action-Recognition

Install the required dependencies:

$ pip install -r requirements.txt

Demo Images

Usage

Run the GUI application: python GUI.py
The GUI will open, allowing you to interact with the system.
To analyze an image, click the "Select Image" button and choose an image from your local system.
Click the "Analyze image" button to initiate the recognition process.
The system will display the recognized action class below the image.
Repeat steps 3 to 5 for analyzing more images.

Training the Model

If you want to train this model on your own, follow these instructions (after Installation):

Download model weights:

$ wget https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/ViT-B_16.npz

Download and unzip dataset

$ wget http://vision.stanford.edu/Datasets/Stanford40.zip
$ unzip Stanford40.zip

Now you can change config.json and then run train.py file:

$ python train.py

Alternatively, you can simply run the train.ipynb notebook

Accuracy

During the training of the image action recognition model, the model achieved the following accuracy on the test set:

Test Accuracy: 82%
Top 2 Accuracy: 90%

These accuracy metrics demonstrate the effectiveness of the trained model in recognizing human action classes in images from the Stanford40 dataset.

Model

The image action recognition model used in this project is based on the Vision Transformer (VIT) architecture. The VIT model is a state-of-the-art deep learning model for image classification tasks.

The VIT model implementation used in this project is based on the repository ViT-pytorch by jeonsworld. The pre-trained VIT model provided in this repository is fine-tuned on the Stanford40 dataset for action recognition.

For more details on the architecture and implementation of the VIT model, please refer to ViT-pytorch repository.

Project Structure

The project follows the structure template provided by pytorch-template, which provides a well-organized and scalable project structure for deep learning projects. Below is a brief overview of the project structure:

Folder Structure

pytorch-template/
│
├── train.py - main script to start training
├── test.py - evaluation of trained model
├── GUI.py - Graphical User Interface of project 
│
├── config.json - holds configuration for training
├── parse_config.py - class to handle config file and cli options
│
├── train.ipynb - train notebook
├── model_best.pth - best model state dicts
│
├── base/ - abstract base classes
│   ├── base_data_loader.py
│   ├── base_model.py
│   └── base_trainer.py
│
├── data_loader/ - create dataset and data loader
│   └── data_loaders.py
│
├── model/ - models, losses, and metrics
│   ├── model.py
│   ├── metric.py
│   └── loss.py
│
├── trainer/ - trainers
│   └── trainer.py
│
├── logger/ - module for tensorboard visualization and logging
│   ├── visualization.py
│   ├── logger.py
│   └── logger_config.json
│  
└── utils/ - small utility functions
    ├── util.py
    └── ...

Dataset

The Stanford40 dataset used in this project contains images of human action classes and their corresponding annotations. You can find more information and access the dataset from the official Stanford40 website: Stanford40 Dataset.

License

This project is licensed under the MIT License.

Acknowledgments

The Stanford40 dataset was created and made publicly available by the Stanford Vision Lab.
The VIT model used in this project were developed by jeonsworld.

Contributing

Contributions are welcome! If you find any issues or want to enhance the project, feel free to submit a pull request.

Contact

For any inquiries or feedback, please contact ARHPA00@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Action Recognition with Stanford40 Dataset and GUI

Table of Contents

Features

Requirements

Installation

Demo Images

Usage

Training the Model

Accuracy

Model

Project Structure

Folder Structure

Dataset

License

Acknowledgments

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
base		base
data_loader		data_loader
demo_images		demo_images
logger		logger
model		model
trainer		trainer
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
GUI.py		GUI.py
README.md		README.md
config.json		config.json
model_best.pth		model_best.pth
parse_config.py		parse_config.py
requirements.txt		requirements.txt
test.py		test.py
train.ipynb		train.ipynb
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Image Action Recognition with Stanford40 Dataset and GUI

Table of Contents

Features

Requirements

Installation

Demo Images

Usage

Training the Model

Accuracy

Model

Project Structure

Folder Structure

Dataset

License

Acknowledgments

Contributing

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages