ubuntu 18.04
python==3.8
torch==1.7.1
transformers==4.6.1
tqdm==4.64.0
numpy==1.22.3
We fetch JCSD and PCSD from https://github.com/gingasan/sit3.
To get the top N action words of each dataset, you can run the code as follows:
python utils/split.py \
--dataset_name JCSD \
--aw_cls 40To get the deduplicated dataset, you can run the code as follows:
python dataset/build_JCSD_PCSD.py
python dataset/build_SiT.pypython encoder_finetune.py \
--output_dir outputdir/ESALE \
--dataset_name JCSD \
--model_name_or_path microsoft/unixcoder-base \
--with_test \
--with_mlm \
--with_ulm \
--with_awp \
--with_cuda \
--epochs 50Since it takes too much time to generate summaries, we randomly choose 10% data from test dataset as test_demo when training a decoder.
python decoder_finetune.py \
--output_dir outputdir/ESALE \
--dataset_name JCSD \
--model_name_or_path microsoft/unixcoder-base \
--unified_encoder_path outputdir/ESALE/unified_encoder_model/model.pth \
--do_train \
--do_eval \
--do_pred \
--with_cuda \
--eval_steps 5000 \
--train_steps 100000python predict.py \
--output_dir outputdir/ESALE \
--dataset_name JCSD \
--model_name_or_path microsoft/unixcoder-base \
--unified_encoder_path outputdir/ESALE/unified_encoder_model/model.pth \
--load_model_path outputdir/ESALE/checkpoint-best-bleu/model.bin \
--with_cuda