This repo contains the code for the paper PAM: Paraphrase AMR-Centric Evaluation Metric, by Afonso Sousa & Henrique Lopes Cardoso (ACL Findings 2025).
Paraphrasing is rooted in semantics, which makes evaluating paraphrase generation systems hard. Current paraphrase generators are typically evaluated using borrowed metrics from adjacent text-to-text tasks, like machine translation or text summarization. These metrics tend to have ties to the surface form of the reference text. This is not ideal for paraphrases as we typically want variation in the lexicon while persisting semantics. To address this problem, and inspired by learned similarity evaluation on plain text, we propose PAM, a Paraphrase AMR-Centric Evaluation Metric. This metric uses AMR graphs extracted from the input text, which consist of semantic structures agnostic to the text surface form, making the resulting evaluation metric more robust to variations in syntax or lexicon. Additionally, we evaluated \pam on different semantic textual similarity datasets and found that it improves the correlations with human semantic scores when compared to other AMR-based metrics.
First, to create a fresh conda environment with all the used dependencies run:
conda env create -f environment.yml
Additionally, for most scripts you will need the pretrained AMR parser. We used parse_xfm_bart_large from here. Download it, rename it to amr_parser, and place it in the root directory.
Go to data/README and extract the third-party data into /data folder.
data
└── dataset_name
└── main
└──raw
│ src.dev.amr
│ src.test.amr
│ tgt.dev.amr
│ tgt.test.amr
Then use merge_dataset.sh to merge the information into a json file. For the aforementioned example, the output file should be placed under /main.
To train/test PAM or any other model refered to in the paper you can run the corresponding script. For example:
sh ./scripts/train_pam.sh
sh ./scripts/test_pam.sh
To further finetune the trained model on Quora Question Pairs (QQP), run:
sh ./scripts/paraphrase_finetune.sh
For many experiments reported in the paper, we used third-party libraries integrated into our source code, which require you to extract them to the root directory and potentially install the respective packages -- for example, AlignScore.
Others, like WWLK, were computed using the original source code.
Some files were used for smaller, single experiments:
- computing static embeddings
- plotting
PAMandSBERTscore distribution - auxiliar to compute AMR similarity metrics
- compute computational cost
- compute statistics for ETPC dataset.
This project used code and took inspiration from the following open source projects:
This project is released under the MIT License.