SASAR

SASAR is flexible text-editing approach for generation, designed to derive maximum benefit from the ideas of decoding with bi-directional contexts and self-supervised pretraining. We achieve this by decomposing the text-editing task into two sub-tasks: tagging to decide on the subset of input tokens and their order in the output text and insertion to in-fill the missing tokens in the output not present in the input.

⚠️ Disclaimer
SASAR was designed with paraphrase generation in mind. In this task, a large portion of the output tokens are often identical to the input text, which makes a tagging + insertion formulation especially appealing.
In our experiments, however, SASAR did not consistently surpass strong baselines. Still, the decomposition into “reuse what’s already in the input” + “generate only what’s new” may inspire further work, and the code here can serve as a foundation for others exploring edit-based approaches to paraphrasing.

Usage Instructions

Running an experiment with SASAR consists of the following steps:

Create label_map for tagging model
Convert data for insertion/tagging model.
Finetune the tagging/insertion models.
Compute predictions.

But, don't worry! Everything is readily available as bash scripts.

1. Label map construction

sh ./scripts/vocabulary_constructor.sh

This just creates a dictionary with the Tagger labels.

2. Converting data for insertion/tagging model

We use two datasets: the PAWS dataset that can be found in the official repository or for ease of use through the HuggingFace Hub; and the Quora Question Pairs that can be found in the official post or using the HuggingFace Hub and the subset pair.

# For Felix-type data preprocessing.
sh ./scripts/preprocess_data.sh

Or

# For SASAR-type data preprocessing.
sh ./scripts/preprocess_data_with_data.sh

For the latter, you may want to pre-extract the AMR graphs (altough not required). You can do so running:

sh ./scripts/run_amr_cache.sh

Additionally, you will need to fuse the tagging and insertion datasets with:

python join_tag_and_insert_data.py

You may run it twice for train and validation sets.

3. Model Training

You have a lot a flavours of models to train. Just check scripts and search for train_sasar_*.sh.

For split tagger and inserter, the models can be trained independently, so it might be quicker to train them in parallel rather than sequentially.

4. Prediction

Same goes for prediction. Check scripts and search for predict_sasar_*.sh.

License

MIT; see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
cache_files		cache_files
metrics		metrics
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
amr_parser		amr_parser
amr_utils.py		amr_utils.py
analyze_masks.py		analyze_masks.py
beam_search.py		beam_search.py
beam_search_test.py		beam_search_test.py
bert_example_test.py		bert_example_test.py
cache_amrs.py		cache_amrs.py
compare_latency.py		compare_latency.py
compute_dataset_ter.py		compute_dataset_ter.py
compute_metrics.py		compute_metrics.py
constants.py		constants.py
converter_for_insert.py		converter_for_insert.py
converter_for_insert_test.py		converter_for_insert_test.py
data_collator.py		data_collator.py
insertion_converter.py		insertion_converter.py
insertion_converter_test.py		insertion_converter_test.py
join_tag_and_insert_data.py		join_tag_and_insert_data.py
joint_model.py		joint_model.py
latency_ibleu_results.csv		latency_ibleu_results.csv
latency_vs_ibleu.png		latency_vs_ibleu.png
my_models.py		my_models.py
my_models_test.py		my_models_test.py
my_tagger.py		my_tagger.py
my_tagger_test.py		my_tagger_test.py
phrase_utils.py		phrase_utils.py
phrase_vocabulary_constructor_main.py		phrase_vocabulary_constructor_main.py
pointing.py		pointing.py
pointing_converter.py		pointing_converter.py
pointing_converter_test.py		pointing_converter_test.py
predict.py		predict.py
predict_main.py		predict_main.py
predict_test.py		predict_test.py
preprocess.py		preprocess.py
preprocess_main.py		preprocess_main.py
sinkhorn_layer.py		sinkhorn_layer.py
test_t5.py		test_t5.py
testing_stuff.py		testing_stuff.py
train_sasar.py		train_sasar.py
train_t5.py		train_t5.py
transformer_example.py		transformer_example.py
utils.py		utils.py
utils_test.py		utils_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SASAR

Usage Instructions

1. Label map construction

2. Converting data for insertion/tagging model

3. Model Training

4. Prediction

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SASAR

Usage Instructions

1. Label map construction

2. Converting data for insertion/tagging model

3. Model Training

4. Prediction

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages