Official PyTorch implementation of "Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement", accepted at ICLR 2026.
GIRCSE is a novel framework that transforms decoder-only LLMs into powerful text encoders by leveraging their generative nature. By generating "soft refinement tokens," the model iteratively distills semantic information into a high-quality embedding representation.
- [2026.02] Checkpoints for Mistral and Qwen models are now available on Hugging Face!
- [2026.01] GIRCSE has been accepted to ICLR 2026! 🎉
- [2025.09] Paper released on arXiv.
We provide pre-trained LoRA adapters for GIRCSE based on different LLM backbones. You can find them on Hugging Face:
| Model | Base LLM | Checkpoint (HF) |
|---|---|---|
| GIRCSE-Mistral7B | Mistral-7B-v0.1 | 🤗 Roytsai27/GIRCSE-Mistral7B |
| GIRCSE-Qwen7B | Qwen2.5-7B | 🤗 Roytsai27/GIRCSE-QWEN7B |
- Python 3.10
- Poetry for dependency management
-
Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 - -
Create and activate a Conda environment:
conda create -n gircse python=3.10 conda activate gircse
-
Install dependencies:
poetry install
-
Install flash attention:
pip install flash-attn==2.8.3 --no-build-isolation
To train a GIRCSE model, use the training script provided in scripts/train.sh:
bash scripts/train.shYou can customize the training by modifying the following parameters:
MODEL_NAME: Base model to use (e.g.,Qwen/Qwen2.5-0.5Bormistralai/Mistral-7B-v0.1)CUDA_VISIBLE_DEVICES: GPU device ID to use--per_device_train_batch_size: Batch size per device--gradient_accumulation_steps: Number of gradient accumulation steps--max_new_tokens: Maximum tokens to generate for embeddings--wandb_project: Weights & Biases project name for experiment tracking--pooling_method: Pooling method for embeddings (e.g.,generate_mean)--data_sampling_rate: Fraction of data to use for training--reg_weight: Regularization weight--output_dir: Output directory for checkpoints
To evaluate a trained GIRCSE model using MTEB benchmarks, use the evaluation script in scripts/eval_mteb.sh:
bash scripts/eval_mteb.shIf you find this work useful, please cite our paper:
@article{tsai2025let,
title={Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement},
author={Tsai, Yu-Che and Chen, Kuan-Yu and Li, Yuan-Chi and Chen, Yuan-Hao and Tsai, Ching-Yu and Lin, Shou-De},
journal={arXiv preprint arXiv:2509.24291},
year={2025}
}