OmniCaptioner: One Captioner to Rule Them All

💜 HomePage | 🤗 Hugging Face | 📑 Paper

📰 News

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

📊 Quantatitive Performance

💻 Finetuning Code

1. Create a conda environment and install PyTorch

conda create -n OmniCap python=3.9
conda activate OmniCap

2.Install dependencies

pip install -r requirements.txt

3. Install flash-attn

pip install flash-attn --no-build-isolation

4. Prepare data

Step1: Compute the image visual token and text token in advance to group together features of the dataset of roughly the same length

You can loop through all 32 splits (for example) using the following bash script: process_splits.sh

# Define your arguments
OUTPUT_DIR="/path/to/output/directory"
DATA_FILE="/path/to/input/data.jsonl"
DATA_NAME="dataset_name"
TASKTYPE="task_name"
SPLIT_NUM=32

# Loop through all splits (from 0 to 31)
for SPLIT_INDEX in $(seq 0 31); do
    python pre_token_compute_add_tasktype.py \
      --split_index $SPLIT_INDEX \
      --split_num $SPLIT_NUM \
      --output_dir $OUTPUT_DIR \
      --data_file $DATA_FILE \
      --data_name $DATA_NAME \
      --tasktype $TASKTYPE
done

Merge the Processed JSONL Files merge_splits.sh

#!/bin/bash

# Define output file to merge into
FINAL_OUTPUT_FILE="/path/to/output/merged_data.jsonl"

# Concatenate all the splits into the final output file
cat /path/to/output/directory/*_split*.jsonl > $FINAL_OUTPUT_FILE

Step2: You can place the links to your data files in ``data/omni_caption_pretrain.yaml".

The processed data in the output JSONL file will have the following format:

{
  "id": "film.jpg",
  "image": "images/film.jpg",
  "conversations": [
    {
      "from": "human",
      "value": "Explain the key features of this picture."
    },
    {
      "from": "gpt",
      "value": "Close-up shot of a scratched symbol resembling an eye or fish, etched into a worn, dark wooden surface. The scratches are uneven and look hastily carved, occupying the center of the frame. Minimalist background with emphasis on the raw texture of aged wood, dimly lit with shadows toward the top right corner, suggesting a single light source above. The grain of the wood is visibly highlighted by soft lighting, creating an intimate and mysterious mood. Dark, moody, high contrast, shallow depth of field, film noir aesthetic, focus on texture and detail, rustic and enigmatic atmosphere."
    }
  ],
  "token_length": 127,
  "image_token_num": 549,
  "task_type": "detailed"
}

5. Start finetuning

bash scripts/finetune_caption_slurm.sh

🚀 Inference Code for Image Caption

You can caption the image with AIGC style using the following command:

CUDA_VISIBLE_DEVICES=0 python src/inference_single_image.py \
    --model_path your_model_path \
    --image_path your_image_path  \
    --image_type aigc

You can caption the image with OCR style using the following command:

CUDA_VISIBLE_DEVICES=0 python src/inference_single_image.py \
    --model_path U4R/OmniCaptioner \
    --image_path your_image_path  \
    --image_type ocr

🚀 Inference Code for Video Caption

You can caption the video data using the following command:

CUDA_VISIBLE_DEVICES=0 python src/inference_single_video.py \
    --input_dir your_video_folder_path \
    --output_file caption_video.jsonl \
    --input_type  video

🚀 Evaluation Code with LLM

cd VLMEvalkit
conda create -n VLMEvalkit python=3.9
conda activate VLMEvalkit
pip install -e .

CUDA_VISIBLE_DEVICES=0  nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-qwen2-5-3B --verbose > output_omnicap_qwen2-5-3B_MMMU_DEV_VAL.log 2>&1 &
CUDA_VISIBLE_DEVICES=0,1  nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-qwen2-5-7B --verbose > output_omnicap_qwen2-5-7B_MMMU_DEV_VAL.log 2>&1 &
CUDA_VISIBLE_DEVICES=0,1,2,3  nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-qwen2-5-32B --verbose > output_omnicap_qwen2-5-32B_MMMU_DEV_VAL.log 2>&1 &

CUDA_VISIBLE_DEVICES=0  nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-deepseek-distill-7B --verbose > output_omnicap_deepseek_distill_3B_MMMU_DEV_VAL.log 2>&1 &
CUDA_VISIBLE_DEVICES=0,1  nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-deepseek-distill-32B --verbose > output_omnicap_deepseek_distill_32B_MMMU_DEV_VAL.log 2>&1 &
CUDA_VISIBLE_DEVICES=0,1,2,3  nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-deepseek-distill-70B --verbose > output_omnicap_deepseek_distill_70B_MMMU_DEV_VAL.log 2>&1 &

Citation

If you find the provided code or models useful for your research, consider citing them as:

@article{omnicaptioner,
  title={OmniCaptioner: One Captioner to Rule Them All},
  author={Lu, Yiting and Yuan, Jiakang and Li, Zhen and Zhao, Shitian and Qin, Qi and Li, Xinyue and Zhuo, Le and Wen, Licheng and Liu, Dongyang and Cao, Yuewen and others},
  journal={arXiv preprint arXiv:2504.07089},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
VLMEvalKit		VLMEvalKit
assets		assets
data		data
dataset		dataset
scripts		scripts
src		src
README.md		README.md
hostfile		hostfile
pre_token_compute_add_tasktype.py		pre_token_compute_add_tasktype.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OmniCaptioner: One Captioner to Rule Them All

📰 News

📊 Quantatitive Performance

💻 Finetuning Code

1. Create a conda environment and install PyTorch

2.Install dependencies

3. Install flash-attn

4. Prepare data

5. Start finetuning

🚀 Inference Code for Image Caption

🚀 Inference Code for Video Caption

🚀 Evaluation Code with LLM

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Alpha-Innovator/OmniCaptioner

Folders and files

Latest commit

History

Repository files navigation

OmniCaptioner: One Captioner to Rule Them All

📰 News

📊 Quantatitive Performance

💻 Finetuning Code

1. Create a conda environment and install PyTorch

2.Install dependencies

3. Install flash-attn

4. Prepare data

5. Start finetuning

🚀 Inference Code for Image Caption

🚀 Inference Code for Video Caption

🚀 Evaluation Code with LLM

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages