This project is designed to run on Linux, Ubuntu, with the shell program of bash.
Run which anaconda to make sure the existence of anaconda.
Run the following command to create a new conda environment.
conda create --prefix /home/$USER/llm2rec-venv python=3.12 -yActivate your conda environment.
source activate /home/$USER/llm2rec-venv[!NOTE] If you use
bash, you can directly usepipcommand after activation. If not, I suggest you to use the full path/home/$USER/llm2rec-venv/bin/pipforpipand all following commands to make sure you are in the correct environment.
Install pytorch. Change the download link according to your CUDA version.
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126Install regular requirements.
pip install transformers==4.44.2 llm2vec==0.2.3 wandb fire ninja loguruInstall flash-attn.
Option 1: Directly build from source. Takes 4 to 5 hours.
pip install flash-attn==2.7.4post1Option 2: Install from pre-built binaries. Download your cuda version compatible wheel from here and pip install from the downloaded wheel.
Apply for a hugginface token and a wandb token.
Login for hugging face by running this command and then enter your token.
huggingface-cli loginYou can also add this to your .bashrc file.
EXPORT HF_TOKEN=$your_tokenOnly after training start will wandb ask for your token.
Login to your huggingface account. Then, run the following two commands to download trained models to ./output and dataset to ./data.
Trained Models
python -m hf --pull --local ./output --repoid YzHuangYanzhen/CSIT5210-output --repotype modelDatasets
python -m hf --pull --local ./data --repoid YzHuangYanzhen/CSIT5210-data --repotype datasetThe Qwen2-0.5B base model can be downloaded by running:
huggingface-cli download Qwen/Qwen2-0.5B \
--local-dir /home/$USER/huggingface_data/hub/Qwen2-0.5B \
--local-dir-use-symlinks FalseReplace the Qwen2-0.5B model path (base_model_path) in the CSFT train arguments.
Run the following command to explore the whole process.
sh sh_allocate_gpu.sh: Allocates GPU of SupserPOD.python -m data_process: Process raw data to grained.sh sh_train_csft.sh: Run CSFT training.sh sh_run_iem_mntp.sh: Run MNTP training.sh sh_run_iem_SimCSE.sh: Run SimCSE training.sh sh_run_extract_embedding.sh: Extract embedding for all categories.sh sh_run_downstream.sh: Train and evaluate downstream models.
- configs: Configurations for all training and evaluation.
- run_LLM: Downstream tasks.
- downstream_model_class: Downstream model definitions.
- data_classes.py: Training and model configuration protocol for downstream tasks.
- model_classes.py: Definition of
DownstreamModelclass, along withGRU4RecandSASRec. - modules.py: NN modules for downstream model definition, including
FNNandTransformerBlock.
- datasest.py: (There is a typo, originally intended to be
datasets.py) The ID dataset wrapper ofDatasetfor downstream tasks. - encoder.py: Encoder to extract embeddings for downstream.
- modules.py:
DownstreamTrainSuiteclass definition and aMainclass for running the full cycle of downstream model training in one category.
- downstream_model_class: Downstream model definitions.
- train_LLM: Upstream tasks.
- data_classes.py: Training and model configuration protocol for upstream training tasks.
- modules.py:
TrainSuiteclass definition, along with utility class for CSFT. - train_csft.py: Definition of
CSFTTrainSuiteclass and CSFT training logic. - train_iem_mntp: Definition of
MNTPTrainSuiteclass and MNTP training logic. - train_iem_simcse.py: Definition of
SimCSETrainSuiteclass and SimCSE training logic.
- utils: Utilities.
- data_process.py: Dataset processing logic and
CategoryLoaderclass definition. - hf.py: Huggingface repo push and pull logic.
- data: All related data, including dataset and benchmark results.
- output: All related models.
LLM2Rec: LLM2Rec: Large Language Models Are Powerful Embedding Models for Sequential Recommendation
@inproceedings{he2025llm2rec,
title={LLM2Rec: Large Language Models Are Powerful Embedding Models for Sequential Recommendation},
author={He, Yingzhi and Liu, Xiaohao and Zhang, An and Ma, Yunshan and Chua, Tat-Seng},
booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2},
pages={896--907},
year={2025}
}Amazon Reviews 2023
@article{hou2024bridging,
title={Bridging Language and Items for Retrieval and Recommendation},
author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
journal={arXiv preprint arXiv:2403.03952},
year={2024}
}SimCSE
@misc{gao2022simcsesimplecontrastivelearning,
title={SimCSE: Simple Contrastive Learning of Sentence Embeddings},
author={Tianyu Gao and Xingcheng Yao and Danqi Chen},
year={2022},
eprint={2104.08821},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2104.08821},
}