This project implements a CLIP-style model for learning joint image-text embeddings on the MNIST dataset.
• Developed a CLIP-style model for handwritten digit recognition and representation learning.
• Encodes images with a residual CNN and labels with a lightweight text encoder.
• Supports both classification and image similarity tasks.
-
Install dependencies:
pip install -r requirements.txt -
Train the model:
python train.py -
Run inference:
python inference.py
• Explore larger text embeddings for richer semantic representations.
• Incorporate data augmentation to improve generalization.
• Evaluate on more complex datasets beyond MNIST.