SASAR is flexible text-editing approach for generation, designed to derive maximum benefit from the ideas of decoding with bi-directional contexts and self-supervised pretraining. We achieve this by decomposing the text-editing task into two sub-tasks: tagging to decide on the subset of input tokens and their order in the output text and insertion to in-fill the missing tokens in the output not present in the input.
⚠️ Disclaimer
SASAR was designed with paraphrase generation in mind. In this task, a large portion of the output tokens are often identical to the input text, which makes a tagging + insertion formulation especially appealing.
In our experiments, however, SASAR did not consistently surpass strong baselines. Still, the decomposition into “reuse what’s already in the input” + “generate only what’s new” may inspire further work, and the code here can serve as a foundation for others exploring edit-based approaches to paraphrasing.
Running an experiment with SASAR consists of the following steps:
- Create label_map for tagging model
- Convert data for insertion/tagging model.
- Finetune the tagging/insertion models.
- Compute predictions.
But, don't worry! Everything is readily available as bash scripts.
sh ./scripts/vocabulary_constructor.sh
This just creates a dictionary with the Tagger labels.
We use two datasets: the PAWS dataset that can be found in the official repository or for ease of use through the HuggingFace Hub; and the Quora Question Pairs that can be found in the official post or using the HuggingFace Hub and the subset pair.
# For Felix-type data preprocessing.
sh ./scripts/preprocess_data.sh
Or
# For SASAR-type data preprocessing.
sh ./scripts/preprocess_data_with_data.sh
For the latter, you may want to pre-extract the AMR graphs (altough not required). You can do so running:
sh ./scripts/run_amr_cache.sh
Additionally, you will need to fuse the tagging and insertion datasets with:
python join_tag_and_insert_data.py
You may run it twice for train and validation sets.
You have a lot a flavours of models to train. Just check scripts and search for train_sasar_*.sh.
For split tagger and inserter, the models can be trained independently, so it might be quicker to train them in parallel rather than sequentially.
Same goes for prediction. Check scripts and search for predict_sasar_*.sh.
MIT; see LICENSE for details.