Skip to content

INSPECT (Image-Text Network for Systematic Pest Classification Task), Multimodal Few-Shot Crop Pest & Disease Recognition with Image-Text Alignment & Fusion.

Notifications You must be signed in to change notification settings

AugustZhang-YJ/INSPECT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INSPECT: Image–Text Network for Systematic Pest Classification

Multimodal few-shot pest & disease recognition with a CNN–Transformer image encoder, a Sentence‑T5 text encoder, and a bidirectional cross‑attention encoder.

INSPECT Overview

Python PyTorch License


🌟 Highlights

  • Dual-Encoder + Decoder Architecture: Image and text are encoded separately then fused via bi‑directional cross‑attention for robust multimodal reasoning.
  • Image Encoder: A ConViT hybrid (CNN + Transformer) backbone with lightweight coordinate-style attention; outputs 1024‑D features for alignment and classification.
  • Two Supervisory Signals: ITC (image–text contrastive) for alignment + CLS (classification) for discrimination—optimized jointly.
  • Few‑Shot Friendly: Consistent gains in low‑data regimes with natural‑language descriptions as side information.

📦 Repository Structure

INSPECT/

├─ Inspect/
│  ├─ model.py                      # INSPECT core (image/text encoders + decoder)
│  ├─ model_resvit.py               # Variant using ResViT-Rice blocks
│  ├─ train.py                      # Supervised training on paired image–text data
│  ├─ pretrain.py                   # Contrastive pretraining (AMP supported)
│  ├─ ImageEncoder/
│  │  ├─ ConViT/                    # CNN + Transformer hybrid
│  │  └─ ResViT/                    # Residual + ViT blocks
│  ├─ TextEncoder/
│  │  └─ ST5.py                     # Sentence‑T5 encoder (frozen) + projection
│  └─ Decoder/                      # Multimodal decoder + MLP head
├─ classic_model/                   # Baselines & evaluation utilities
├─ multimodal_model/                # (e.g., CLIP adapters)
└─ data_augmentation/               # Optional augmentation pipelines
└─ LLM/                             # Caption generation utilities (e.g., GPT‑4V, Gemini)

Checkpoints are saved under Inspect/models/ and logs/metrics under exports/ by default.


🔧 Installation

Tested with Python ≥3.9 and PyTorch ≥2.0 (CUDA recommended).

# 1) Create an environment (recommended)
conda create -n inspect python=3.10 -y
conda activate inspect

# 2) Install PyTorch (choose the right CUDA build from pytorch.org)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# 3) Install repo dependencies
pip install sentence-transformers pandas pillow tqdm scikit-learn matplotlib

📚 Data Preparation

The training code expects paired image–text data in a simple class‑per‑folder layout. For every *.jpg there should be a sibling *.txt that contains a leakage‑safe natural‑language description of the image (no explicit species names).

DATASET_ROOT/
├─ class_A/
│  ├─ xxx.jpg
│  ├─ xxx.txt   # caption describing what’s visible in xxx.jpg
│  └─ ...
├─ class_B/
│  ├─ yyy.jpg
│  ├─ yyy.txt
│  └─ ...
└─ ...
  • Place train/test splits as separate roots, e.g., .../train/ and .../test/.
  • Each .txt should contain a single free‑form caption (≤512 tokens is sufficient).

If you need to regenerate captions, see examples under LLM/ and ensure captions do not leak class labels.


⚙️ Training

cd Inspect

# Example: 100 epochs, batch size 16
python train.py --batch-size 16 --epochs 100 --lr 1e-4 --device cuda --frozen False
  • Checkpoints: models/{run_name}_{epoch}.pth
  • Metrics: exports/{run_name}_loss_accuracy.csv

🔁 (Optional) Contrastive Pretraining

For large‑scale image–text alignment before supervised finetuning:

cd Inspect
python pretrain.py --batch-size 16 --epochs 50 --lr 1e-4 --amp True

This stage strengthens multimodal features via ITC and often improves few‑shot/generalization performance.


🧠 Model Architecture

  • Image Encoder (ConViT hybrid): local CNN tokens + global Transformer tokens, fused and projected to 1024‑D with lightweight positional/channel cues.
  • Text Encoder (Sentence‑T5): frozen encoder with a linear projection to 1024‑D.
  • Multimodal Decoder: a two‑way cross‑attention transformer refines image & text features jointly, followed by an MLP classification head.
  • Losses:
    • ITC — symmetric cross‑entropy (image→text & text→image).
    • CLS — cross‑entropy over class logits.

Bi-directional cross attention


🙌 Acknowledgements

This repository builds on ideas from CLIP‑style contrastive learning. Thanks to the open‑source community.

Made with ❤️ for robust, practical crop pest & disease recognition.

About

INSPECT (Image-Text Network for Systematic Pest Classification Task), Multimodal Few-Shot Crop Pest & Disease Recognition with Image-Text Alignment & Fusion.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages