Multimodal few-shot pest & disease recognition with a CNN–Transformer image encoder, a Sentence‑T5 text encoder, and a bidirectional cross‑attention encoder.
- Dual-Encoder + Decoder Architecture: Image and text are encoded separately then fused via bi‑directional cross‑attention for robust multimodal reasoning.
- Image Encoder: A ConViT hybrid (CNN + Transformer) backbone with lightweight coordinate-style attention; outputs 1024‑D features for alignment and classification.
- Two Supervisory Signals: ITC (image–text contrastive) for alignment + CLS (classification) for discrimination—optimized jointly.
- Few‑Shot Friendly: Consistent gains in low‑data regimes with natural‑language descriptions as side information.
INSPECT/
├─ Inspect/
│ ├─ model.py # INSPECT core (image/text encoders + decoder)
│ ├─ model_resvit.py # Variant using ResViT-Rice blocks
│ ├─ train.py # Supervised training on paired image–text data
│ ├─ pretrain.py # Contrastive pretraining (AMP supported)
│ ├─ ImageEncoder/
│ │ ├─ ConViT/ # CNN + Transformer hybrid
│ │ └─ ResViT/ # Residual + ViT blocks
│ ├─ TextEncoder/
│ │ └─ ST5.py # Sentence‑T5 encoder (frozen) + projection
│ └─ Decoder/ # Multimodal decoder + MLP head
├─ classic_model/ # Baselines & evaluation utilities
├─ multimodal_model/ # (e.g., CLIP adapters)
└─ data_augmentation/ # Optional augmentation pipelines
└─ LLM/ # Caption generation utilities (e.g., GPT‑4V, Gemini)
Checkpoints are saved under
Inspect/models/and logs/metrics underexports/by default.
Tested with Python ≥3.9 and PyTorch ≥2.0 (CUDA recommended).
# 1) Create an environment (recommended)
conda create -n inspect python=3.10 -y
conda activate inspect
# 2) Install PyTorch (choose the right CUDA build from pytorch.org)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# 3) Install repo dependencies
pip install sentence-transformers pandas pillow tqdm scikit-learn matplotlibThe training code expects paired image–text data in a simple class‑per‑folder layout. For every *.jpg there
should be a sibling *.txt that contains a leakage‑safe natural‑language description of the image (no explicit species
names).
DATASET_ROOT/
├─ class_A/
│ ├─ xxx.jpg
│ ├─ xxx.txt # caption describing what’s visible in xxx.jpg
│ └─ ...
├─ class_B/
│ ├─ yyy.jpg
│ ├─ yyy.txt
│ └─ ...
└─ ...
- Place train/test splits as separate roots, e.g.,
.../train/and.../test/. - Each
.txtshould contain a single free‑form caption (≤512 tokens is sufficient).
If you need to regenerate captions, see examples under
LLM/and ensure captions do not leak class labels.
cd Inspect
# Example: 100 epochs, batch size 16
python train.py --batch-size 16 --epochs 100 --lr 1e-4 --device cuda --frozen False- Checkpoints:
models/{run_name}_{epoch}.pth - Metrics:
exports/{run_name}_loss_accuracy.csv
For large‑scale image–text alignment before supervised finetuning:
cd Inspect
python pretrain.py --batch-size 16 --epochs 50 --lr 1e-4 --amp TrueThis stage strengthens multimodal features via ITC and often improves few‑shot/generalization performance.
- Image Encoder (ConViT hybrid): local CNN tokens + global Transformer tokens, fused and projected to 1024‑D with lightweight positional/channel cues.
- Text Encoder (Sentence‑T5): frozen encoder with a linear projection to 1024‑D.
- Multimodal Decoder: a two‑way cross‑attention transformer refines image & text features jointly, followed by an MLP classification head.
- Losses:
- ITC — symmetric cross‑entropy (image→text & text→image).
- CLS — cross‑entropy over class logits.
This repository builds on ideas from CLIP‑style contrastive learning. Thanks to the open‑source community.
Made with ❤️ for robust, practical crop pest & disease recognition.

