GIRCSE

Official PyTorch implementation of "Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement", accepted at ICLR 2026.

GIRCSE is a novel framework that transforms decoder-only LLMs into powerful text encoders by leveraging their generative nature. By generating "soft refinement tokens," the model iteratively distills semantic information into a high-quality embedding representation.

🚀 News

[2026.02] Checkpoints for Mistral and Qwen models are now available on Hugging Face!
[2026.01] GIRCSE has been accepted to ICLR 2026! 🎉
[2025.09] Paper released on arXiv.

📦 Model Zoo

We provide pre-trained LoRA adapters for GIRCSE based on different LLM backbones. You can find them on Hugging Face:

Model	Base LLM	Checkpoint (HF)
GIRCSE-Mistral7B	Mistral-7B-v0.1	🤗 Roytsai27/GIRCSE-Mistral7B
GIRCSE-Qwen7B	Qwen2.5-7B	🤗 Roytsai27/GIRCSE-QWEN7B

🛠️ Setup

Prerequisites

Python 3.10
Poetry for dependency management

Installation

Install Poetry (if not already installed):

curl -sSL https://install.python-poetry.org | python3 -

Create and activate a Conda environment:

conda create -n gircse python=3.10
conda activate gircse

Install dependencies:
```
poetry install
```

Install flash attention:

pip install flash-attn==2.8.3 --no-build-isolation

Training

To train a GIRCSE model, use the training script provided in scripts/train.sh:

bash scripts/train.sh

You can customize the training by modifying the following parameters:

MODEL_NAME: Base model to use (e.g., Qwen/Qwen2.5-0.5B or mistralai/Mistral-7B-v0.1)
CUDA_VISIBLE_DEVICES: GPU device ID to use
--per_device_train_batch_size: Batch size per device
--gradient_accumulation_steps: Number of gradient accumulation steps
--max_new_tokens: Maximum tokens to generate for embeddings
--wandb_project: Weights & Biases project name for experiment tracking
--pooling_method: Pooling method for embeddings (e.g., generate_mean)
--data_sampling_rate: Fraction of data to use for training
--reg_weight: Regularization weight
--output_dir: Output directory for checkpoints

Evaluation

To evaluate a trained GIRCSE model using MTEB benchmarks, use the evaluation script in scripts/eval_mteb.sh:

bash scripts/eval_mteb.sh

Citation

If you find this work useful, please cite our paper:

@article{tsai2025let,
  title={Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement},
  author={Tsai, Yu-Che and Chen, Kuan-Yu and Li, Yuan-Chi and Chen, Yuan-Hao and Tsai, Ching-Yu and Lin, Shou-De},
  journal={arXiv preprint arXiv:2509.24291},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
embedding		embedding
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
accelerate_config.yaml		accelerate_config.yaml
lora.json		lora.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIRCSE

🚀 News

📦 Model Zoo

🛠️ Setup

Prerequisites

Installation

Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GIRCSE

🚀 News

📦 Model Zoo

🛠️ Setup

Prerequisites

Installation

Training

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages