Action Recognition using Vision Transformers 🎬

✨ This repository contains the project for the EEEM068 module ✨

This project explores action recognition in videos using Vision Transformers, with a focus on the TimeSformer architecture. By evaluating various frame sampling strategies, augmentation techniques, and model configurations on the HMDB_simp dataset, the study achieves a top accuracy of 90.3% and demonstrates the effectiveness of transformer-based approaches for capturing spatiotemporal patterns in video data.

Getting Started

Prerequisites

Python 3.10
Jupyter Notebook

Project Structure

The data_exploration folder contains some data analysis notebooks and visualisation tools, and the src folder has detailed information of the models

├── data_exploration
│   ├── clean_data.ipynb
│   ├── frame-filtering.ipynb
│   ├── GradCAM2.ipynb
│   ├── Statistics.ipynb
│   └── ConfusionMatr.ipynb
├── environment.yml
├── README.md
├── src
│   ├── augmentations.py
│   ├── data.py
│   ├── model.py
│   └── sampling.py
└── train.py

Installation

Clone this repository: git clone https://github.com/Elisa-tea/EEEM068.git
Install dependencies:

pip install \
  torch torchvision \
  albumentations albucore \
  scikit-learn matplotlib pandas tqdm ipykernel \
  fastapi uvicorn \
  transformers datasets evaluate \
  gradio wandb accelerate torchmetrics \
  simsimd stringzilla tf-keras

For the GradCAM2.ipynb (optionally for ConfusionMatr.ipynb) download the trained model https://drive.google.com/file/d/1fIcNd6_-NC39UQeRq2SRSY-Iqh-B_Fp2/view?usp=sharing and extract these files on the left bar of the notebook.

Run the program

1. train.py

Run train.py to train the model.
for example, for fixed-step sampling and a clip length of 8, run the following command in the terminal:

python train.py --sampler fixed_step --frame_step 8 --clip_length 8 --train_batch_size 4 --lr 0.00001 --weight_decay 0.095 --use_augmentations(optional) --train_dataset_path /path_to/HMDB_simp_clean --val_dataset_path /path_to/HMDB_simp_clean

The results and logs will show on wandb.

Dataset

The HMDB_simp dataset includes 1,250 videos - 50 videos in each of the 25 categories. Each subfolder of the dataset corresponds to a different action category. The dataset used in this project is HMDB_simp_clean, which is a cleaned version of HMDB_simp with the duplicated frames removed. To get this dataset:

Open the data_exploration/clean_data.ipynb file.
Run the first "Clean Data" section in the file. This creates a cleaned dataset called "HMDB_simp_clean" with the duplicate frames removed.
The rest of the notebook contains checks and visuals comparing the raw and cleaned dataset.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

We would like to thank all group members for their contributions to this project:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Action Recognition using Vision Transformers 🎬

Table of Contents

Getting Started

Prerequisites

Project Structure

Installation

Run the program

1. train.py

Dataset

License

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
data_exploration		data_exploration
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
train.py		train.py

Elisa-tea/EEEM068

Folders and files

Latest commit

History

Repository files navigation

Action Recognition using Vision Transformers 🎬

Table of Contents

Getting Started

Prerequisites

Project Structure

Installation

Run the program

1. train.py

Dataset

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages