Skip to content

Image captioning via ResNet-50 encoder and a Transformer-based decoder

Notifications You must be signed in to change notification settings

turgaybulut/image-captioning

Repository files navigation

Image Captioning Project

This project implements an image captioning model using a CNN encoder (ResNet-50) and a Transformer-based decoder. It's built with PyTorch and designed for flexibility and ease of use, allowing for training, evaluation, and prediction of image captions.

Table of Contents

Project Overview

The core goal of this project is to generate descriptive captions for input images. It leverages a pre-trained ResNet-50 to extract visual features from images and a Transformer decoder to generate textual descriptions based on these features. The entire pipeline, from data loading to model inference, is configurable through a central JSON file.

Features

  • Encoder-Decoder Architecture: Uses ResNet-50 for image encoding and a Transformer for caption decoding.
  • Pre-trained Encoder: Option to use pre-trained ResNet-50 weights and fine-tune or freeze the encoder.
  • Configurable Parameters: Easily manage hyperparameters, paths, and settings via config.json.
  • Vocabulary Management: Builds and saves/loads a vocabulary from training captions.
  • Training Pipeline: Supports training from scratch or resuming from checkpoints, with validation and early stopping.
  • Evaluation: Implements standard COCO evaluation metrics (BLEU, METEOR, ROUGE-L, CIDEr, SPICE) using pycocoevalcap.
  • Prediction: Allows caption generation for single or multiple custom images, or examples from the validation set.
  • Logging: Comprehensive logging for training, evaluation, and prediction processes.
  • Dependency Management: Uses PDM for managing Python dependencies.

Workflow

  1. Configuration: Define paths, hyperparameters, and other settings in config.json.
  2. Data Preparation: The dataset.py script handles loading images and COCO-style captions. A vocabulary is built using vocabulary.py.
  3. Training: Run train.py to train the model. Progress is logged, and checkpoints (including the best model) are saved.
  4. Evaluation: Use evaluate.py to assess the trained model's performance on a test/validation set using COCO metrics. Results are saved.
  5. Prediction: Employ predict.py to generate captions for new images.

Setup

Prerequisites

Installation

  1. Clone the repository (if you haven't already):

    git clone git@github.com:turgaybulut/image-captioning.git
    cd image-captioning
  2. Install PDM (if not already installed):

    pip install pdm
  3. Install project dependencies: This command installs all dependencies, including development tools.

    pdm install -G:all

    If you only need runtime dependencies:

    pdm install

Configuration

All project settings are managed through the config.json file. This includes:

  • Paths: Locations for datasets, vocabulary, model checkpoints, and evaluation results.
  • Dataset Parameters: Vocabulary frequency threshold, subset sizes for quick runs.
  • Dataloader Settings: Batch size, number of workers, pin memory.
  • Model Hyperparameters: Embedding size, decoder layers, heads, feed-forward dimensions, dropout, CNN training flag.
  • Training Settings: Learning rate, number of epochs, model loading flag, gradient clipping, early stopping patience.
  • Prediction Settings: Maximum caption length.

Usage

The project scripts are typically run using PDM.

Training

To train the model:

pdm run train
  • Ensure config.json points to your training images and COCO-style caption files.
  • The script will build/load vocabulary, initialize the model, and start the training loop.
  • Checkpoints and the best model will be saved according to config.json paths.

Evaluation

To evaluate a trained model:

pdm run evaluate
  • This uses the model specified by best_model_checkpoint in config.json.
  • It generates captions for the validation/test set and computes COCO metrics.
  • Evaluation results (generated captions and scores) are logged and saved.

Prediction

To generate captions for new images via the command line:

pdm run predict --image_paths /path/to/your/image1.jpg /path/to/your/image2.png
  • If --image_paths is omitted, it will predict on a few random examples from the validation set specified in config.json.
  • The script loads the best_model_checkpoint and associated vocabulary.

Web Application (Optional)

The project also includes a Flask-based web application (app.py) for interactive caption generation.

Running the Web App:

  1. Ensure all dependencies are installed (as per the Installation section).
  2. Make sure your config.json is correctly set up, especially the paths to the best_model_checkpoint and vocab_file.
  3. Run the Flask application using PDM:
    pdm run app
  4. Open your web browser and navigate to http://localhost:8080 (or the port specified in app.py or your environment).

The web interface allows you to upload an image and view the generated caption. It also includes a feature to visualize attention maps if the model supports it and the generate_caption_with_word_attention method is implemented in model.py.

Dataset

This project is designed to work with datasets in the COCO format.

  • Images: A directory containing image files (e.g., .jpg, .png).
  • Captions: A JSON file in COCO annotation format, containing an "images" list and an "annotations" list. Each annotation should have an image_id and a caption.

Update the paths section in config.json to point to your dataset directories and caption files.

Model Architecture

The model (model.py) consists of:

  • EncoderCNN: A CNN based on a pre-trained ResNet-50 (from torchvision.models) to extract image features. The final classification layer is removed. Can be fine-tuned or frozen.
  • PositionalEncoding: Adds positional information to token embeddings, crucial for Transformers.
  • DecoderTransformer: A stack of Transformer decoder layers (torch.nn.TransformerDecoder) that generates captions word by word based on image features and previously generated words.
  • CaptioningModel: Encapsulates the encoder and decoder. Includes a generate_caption method for inference.

Logging

  • The utils.py module sets up logging for the project.
  • Logs are output to the console and saved to files within the logs/ directory:
    • train.log: For the training script.
    • evaluate.log: For the evaluation script.
    • predict.log: For the prediction script.

Dependencies

Key Python libraries used:

  • PyTorch (torch, torchvision)
  • pycocotools and pycocoevalcap for COCO dataset interaction and evaluation.
  • PIL (Pillow) for image manipulation.
  • tqdm for progress bars.

All dependencies are managed by PDM via pyproject.toml and pdm.lock.

Author

Turgay Bulut

About

Image captioning via ResNet-50 encoder and a Transformer-based decoder

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published