INSPECT: Image–Text Network for Systematic Pest Classification

Multimodal few-shot pest & disease recognition with a CNN–Transformer image encoder, a Sentence‑T5 text encoder, and a bidirectional cross‑attention encoder.

🌟 Highlights

Dual-Encoder + Decoder Architecture: Image and text are encoded separately then fused via bi‑directional cross‑attention for robust multimodal reasoning.
Image Encoder: A ConViT hybrid (CNN + Transformer) backbone with lightweight coordinate-style attention; outputs 1024‑D features for alignment and classification.
Two Supervisory Signals: ITC (image–text contrastive) for alignment + CLS (classification) for discrimination—optimized jointly.
Few‑Shot Friendly: Consistent gains in low‑data regimes with natural‑language descriptions as side information.

📦 Repository Structure

INSPECT/

├─ Inspect/
│  ├─ model.py                      # INSPECT core (image/text encoders + decoder)
│  ├─ model_resvit.py               # Variant using ResViT-Rice blocks
│  ├─ train.py                      # Supervised training on paired image–text data
│  ├─ pretrain.py                   # Contrastive pretraining (AMP supported)
│  ├─ ImageEncoder/
│  │  ├─ ConViT/                    # CNN + Transformer hybrid
│  │  └─ ResViT/                    # Residual + ViT blocks
│  ├─ TextEncoder/
│  │  └─ ST5.py                     # Sentence‑T5 encoder (frozen) + projection
│  └─ Decoder/                      # Multimodal decoder + MLP head
├─ classic_model/                   # Baselines & evaluation utilities
├─ multimodal_model/                # (e.g., CLIP adapters)
└─ data_augmentation/               # Optional augmentation pipelines
└─ LLM/                             # Caption generation utilities (e.g., GPT‑4V, Gemini)

Checkpoints are saved under Inspect/models/ and logs/metrics under exports/ by default.

🔧 Installation

Tested with Python ≥3.9 and PyTorch ≥2.0 (CUDA recommended).

# 1) Create an environment (recommended)
conda create -n inspect python=3.10 -y
conda activate inspect

# 2) Install PyTorch (choose the right CUDA build from pytorch.org)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# 3) Install repo dependencies
pip install sentence-transformers pandas pillow tqdm scikit-learn matplotlib

📚 Data Preparation

The training code expects paired image–text data in a simple class‑per‑folder layout. For every *.jpg there should be a sibling *.txt that contains a leakage‑safe natural‑language description of the image (no explicit species names).

DATASET_ROOT/
├─ class_A/
│  ├─ xxx.jpg
│  ├─ xxx.txt   # caption describing what’s visible in xxx.jpg
│  └─ ...
├─ class_B/
│  ├─ yyy.jpg
│  ├─ yyy.txt
│  └─ ...
└─ ...

Place train/test splits as separate roots, e.g., .../train/ and .../test/.
Each .txt should contain a single free‑form caption (≤512 tokens is sufficient).

If you need to regenerate captions, see examples under LLM/ and ensure captions do not leak class labels.

⚙️ Training

cd Inspect

# Example: 100 epochs, batch size 16
python train.py --batch-size 16 --epochs 100 --lr 1e-4 --device cuda --frozen False

Checkpoints: models/{run_name}_{epoch}.pth
Metrics: exports/{run_name}_loss_accuracy.csv

🔁 (Optional) Contrastive Pretraining

For large‑scale image–text alignment before supervised finetuning:

cd Inspect
python pretrain.py --batch-size 16 --epochs 50 --lr 1e-4 --amp True

This stage strengthens multimodal features via ITC and often improves few‑shot/generalization performance.

🧠 Model Architecture

Image Encoder (ConViT hybrid): local CNN tokens + global Transformer tokens, fused and projected to 1024‑D with lightweight positional/channel cues.
Text Encoder (Sentence‑T5): frozen encoder with a linear projection to 1024‑D.
Multimodal Decoder: a two‑way cross‑attention transformer refines image & text features jointly, followed by an MLP classification head.
Losses:
- ITC — symmetric cross‑entropy (image→text & text→image).
- CLS — cross‑entropy over class logits.

🙌 Acknowledgements

This repository builds on ideas from CLIP‑style contrastive learning. Thanks to the open‑source community.

_{Made with ❤️ for robust, practical crop pest & disease recognition.}

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
Inspect		Inspect
LLM		LLM
Resources		Resources
classic_model		classic_model
data_augmentation/InstructPix2Pix		data_augmentation/InstructPix2Pix
multimodal_model		multimodal_model
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INSPECT: Image–Text Network for Systematic Pest Classification

🌟 Highlights

📦 Repository Structure

🔧 Installation

📚 Data Preparation

⚙️ Training

🔁 (Optional) Contrastive Pretraining

🧠 Model Architecture

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

AugustZhang-YJ/INSPECT

Folders and files

Latest commit

History

Repository files navigation

INSPECT: Image–Text Network for Systematic Pest Classification

🌟 Highlights

📦 Repository Structure

🔧 Installation

📚 Data Preparation

⚙️ Training

🔁 (Optional) Contrastive Pretraining

🧠 Model Architecture

🙌 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages