Transformations for tractable SFG parsing

Overview

This repository contains the code that was used to train and evaluate the models for my master thesis at Saarland University.

Requirements

Install Typer and NLTK.

pip install typer
pip install --user -U nltk

Follow the official instructions to install the Berkeley parser.

Dataset

Build the ellipsis corpus by converting the original SFGbank to graphs and using the conventional splits of the Penn Treebank:

Training: Sections 2-21 of the WSJ
Development: Section 22 of the WSJ
Testing: Section 23 of the WSJ

The graphs contain main edges and secondary edges pointing to elided constituents.

Usage:

python dataset/build_dataset.py [OPTIONS] INPUT_DIR OUTPUT_DIR

Argument	Type	Description	Default
INPUT_DIR	Required	Path to the directory containing the original SFGbank	-
OUTPUT_DIR	Required	Path to the directory where to save the dataset	-
--gen-str-files	Optional	Exclude sentences that exceed the maximum length allowed by BERT	False
--max-len	Optional	Exclude sentences that do not contain ellipsis	False

Example:

python dataset/build_dataset.py data/sfgbank data/graphs --gen-str-files --max-len

Preprocessing

Convert the graphs to phrase-structure trees, using a specific strategy to encode ellipsis.

Usage:

python preprocessing/graph2tree.py [OPTIONS] INPUT_DIR OUTPUT_DIR

Argument	Type	Description	Default
INPUT_DIR	Required	Path to the directory containing the dataset	-
OUTPUT_DIR	Required	Path to the directory where to save the trees	-
--strategy TEXT	Optional	Strategy for encoding ellipsis: start / start-without-pos / end / end-extra-node	end-extra-node
Example:

python preprocessing/graph2tree.py data/graphs data/trees-end-extra-node --strategy end-extra-node

Training

Train the Berkeley parser with the trees encoding ellipsis.

python src/main.py train --use-bert --model-path-base models/end-extra-node --bert-model "bert-large-uncased" --num-layers 2 --learning-rate 0.00005 --batch-size 32 --eval-batch-size 16 --subbatch-max-tokens 500 --train-path data/end-extra-node/train.txt --dev-path data/end-extra-node/dev.txt --predict-tags

Parsing

Parse the test sentences with the trained model.

python src/main.py parse --model-path-base models/end-extra-node/_dev\=93.12.pt --input-path data/strings/test.txt --output-path data/end-extra-node-predicted/test.txt --eval-batch-size 16

Postprocessing

Convert the predicted trees to graphs, recovering ellipsis with the chosen strategy.

Usage:

python postprocessing/graph2tree.py [OPTIONS] INPUT_DIR OUTPUT_DIR

Argument	Type	Description	Default
INPUT_DIR	Required	Path to the directory containing the dataset	-
OUTPUT_DIR	Required	Path to the directory where to save the trees	-
--strategy TEXT	Optional	Strategy for encoding ellipsis: start / start-without-pos / end / end-extra-node	end-extra-node

Example:

python postprocessing/graph2tree.py data/trees-end-extra-node-predicted data/graphs-end-extra-node-predicted --strategy end-extra-node

Evaluation

Evaluate the predicted graphs. To avoid alignment issues, the graphs are evaluated as dependencies.

Usage:

python eval/eval.py GOLD_FILE PREDICTED_FILE

Argument	Type	Description	Default
GOLD_FILE	Required	Path to the file containing the gold test graphs	-
PREDICTED_FILE	Required	Path to the file containing the predicted test graphs	-

Example:

python eval/eval.py data/graphs/test.json data/graphs-end-extra-node-predicted/test.json

Results

Results for all edges, non-elided edges and elided-edges.

All edges	UP	UR	UF	LP	LR	LF
start	86.5	86.5	86.5	85.1	85.1	85.1
start-without-pos	86.6	86.6	86.6	85.2	85.2	85.2
end	87.7	87.7	87.7	86.1	86.1	86.1
end-extra-node	88.8	88.8	88.8	87.5	87.5	87.5
start-end-extra-node	88.9	88.9	88.9	87.3	87.3	87.3
start-end-extra-node-heuristic	89.1	89.1	89.1	87.6	87.6	87.6
sdp	88.0	88.0	88.0	86.5	86.5	86.5

Elided edges	UP	UR	UF	LP	LR	LF
start	65.6	44.5	53.0	65.1	44.2	52.6
start-without-pos	65.5	40.6	50.1	64.9	40.3	49.7
end	75.6	61.4	67.7	75.6	61.4	67.7
end-extra-node	77.0	60.7	67.9	77.0	60.7	67.9
start-end-extra-node	86.6	62.7	72.7	84.3	61.0	70.8
start-end-extra-node-heuristic	87.0	67.2	75.8	83.2	64.3	72.6
sdp	81.2	77.0	79.0	80.5	76.3	78.3

Non-elided edges	UP	UR	UF	LP	LR	LF
start	90.8	90.8	90.8	89.4	89.4	89.4
start-without-pos	91.0	91.0	91.0	89.5	89.5	89.5
end	89.8	89.8	89.8	88.2	88.2	88.2
end-extra-node	91.1	91.1	91.1	89.7	89.7	89.7
start-end-extra-node	91.2	91.2	91.2	89.5	89.5	89.5
start-end-extra-node-heuristic	91.0	91.0	91.0	89.7	89.7	89.7
sdp	90.1	90.1	90.1	88.6	88.6	88.6

Comparison to previous SFG parsing work

We used the same evaluation method and corpora from Costetchi (2020) to compare its performance with that of our system. To evaluate our system, we preprocessed the text with previous_work_comparison/tokenize.py, parsed the constituency trees and extracted constituent spans from them with previous_work_comparison/ptb2mcg.py. We deleted duplicates from the gold annotated spans and the spans predicted by Costetchi's parser with previous_work_comparison/simplify_gold.py. We only considered constituency annotations and ignored the SFG features as well as ellipsis recovery, since Costetchi's system does not handle it. Below are the results for the OE and OCD corpora and an average of both.

OE corpus	Precision	Recall	FScore
Costetchi (2020)	50.16	97.15	65.73
Our system	54.42	93.75	68.75

OCD corpus	Precision	Recall	FScore
Costetchi (2020)	61.45	90.67	73.25
Our system	76.32	94.61	84.48

Average	Precision	Recall	FScore
Costetchi (2020)	55.81	93.91	69.49
Our system	65.36	94.18	76.61

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.gitignore		.gitignore
dataset		dataset
eval		eval
postprocessing		postprocessing
preprocessing		preprocessing
previous_work_comparison		previous_work_comparison
README.md		README.md
thesis.pdf		thesis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformations for tractable SFG parsing

Overview

Requirements

Dataset

Preprocessing

Training

Parsing

Postprocessing

Evaluation

Results

Comparison to previous SFG parsing work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformations for tractable SFG parsing

Overview

Requirements

Dataset

Preprocessing

Training

Parsing

Postprocessing

Evaluation

Results

Comparison to previous SFG parsing work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages