Noun-Noun Compounds

This repository contains multiple attempts to study entropy using the bible, including training LSTMs, computing perplexities with GPT-2, and performing word-pasting and word-splitting experiments. A much more limited of this repository, including only the word-pasting experiments, is in the BibleWordPasting repository.

2025 Update: this paper has been published in Coling-Rel and, therefore, there is no need to maintain it anonymous anymore. However, the link provided in the camera-ready paper is anonymous. The published paper can be found here. The full reference is:

Mosteiro, P., & Blasi, D. (2025). Word boundaries and the morphology-syntax trade-off. In S. Yagi, S. Yagi, M. Sawalha, B. A. Shawar, A. T. AlShdaifat, N. Abbas, & Organizers (Eds.), Proceedings of the New Horizons in Computational Linguistics for Religious Texts (pp. 86–93). Association for Computational Linguistics. https://aclanthology.org/2025.clrel-1.9/

The GitHub repository is WordOrderBibles, owned by PabloMosUU. This repository also supports the paper West Germanic noun-noun compounds and the morphology-syntax trade-off, by Mosteiro, Blasi, and Paperno. The main code used for this analysis is in nn_pasting.py and NounNounCompounds/11_final_paper_plots.py.

Requirements

Python packages required to run this code can be found in requirements.txt. You may install them all at once with:

pip install -r requirements.txt

Mismatcher

The mismatcher [1] is a Java program used in previous work to compute, for each position in a text, the shortest unseen substring at that position [2]. Those values are then used to compute an approximation to the entropy [3].

Running this code

The main entry points are word_pasting.py and word_splitting.py. The usage of these programs is:

python word_pasting.py bible_filename temp_dir output_filename mismatcher_filename

python word_splitting.py bible_filename temp_dir output_filename mismatcher_filename n_merges_full

The parameters are:

bible_filename: the full path of the file containing a single translation of the bible coming from the Parallel Bible Corpus (or any compatible format)
temp_dir: a directory in which temporary files (not the output) will be saved
output_filename: the directory in which output files will be saved
mismatcher_filename: the full path of the Java binary used to compute the lengths of the shortest unseen strings at each position in the bible translation file (see details above)
n_merges_full: the number of maximum merges that should be input to the BPE algorithm (see details below)

`n_merges_full`

n_merges_full is the number of merges to train the BPE tokenizer aiming to build the entire merge history. In [4] we used 10000 for most bibles. The program will print a warning if this number is not high enough to generate the entire merge history. For those bibles, we recommend using n_merges_full = 30000.

References

[1] Koplenig, A., Meyer, P., Wolfer, S., & Müller-Spitzer, C. (2017). Replication Data for: The statistical trade-off between word order and word structure – large-scale evidence for the principle of least effort. https://doi.org/10.7910/DVN/8KH0GB

[2] Koplenig, A., Meyer, P., Wolfer, S., & Müller-Spitzer, C. (2017). The statistical trade-off between word order and word structure – Large-scale evidence for the principle of least effort. PLOS ONE, 12(3), 1–25. https://doi.org/10.1371/journal.pone.0173614

[3] Kontoyiannis, I., Algoet, P., Suhov, Y., & Wyner, A. (1998). Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text. Information Theory, IEEE Transactions On, 44, 1319–1327. https://doi.org/10.1109/18.669425

[4] Mosteiro, P., & Blasi, D. (2025). Word boundaries and the morphology-syntax trade-off. In S. Yagi, S. Yagi, M. Sawalha, B. A. Shawar, A. T. AlShdaifat, N. Abbas, & Organizers (Eds.), Proceedings of the New Horizons in Computational Linguistics for Religious Texts (pp. 86–93). Association for Computational Linguistics. https://aclanthology.org/2025.clrel-1.9/

Name		Name	Last commit message	Last commit date
Latest commit History 501 Commits
NounNounCompounds		NounNounCompounds
WordPasting		WordPasting
WordSplitting		WordSplitting
configs		configs
tutorials		tutorials
unit_tests		unit_tests
.gitignore		.gitignore
10_reproduce_koplenig_et_al_fig_1.ipynb		10_reproduce_koplenig_et_al_fig_1.ipynb
11_validate_bpw.ipynb		11_validate_bpw.ipynb
12_entropy_bpw.py		12_entropy_bpw.py
13_repro_montemurro_zanette.ipynb		13_repro_montemurro_zanette.ipynb
14_repro_bentz_et_al.ipynb		14_repro_bentz_et_al.ipynb
15_nsb_unigram_entropy.py		15_nsb_unigram_entropy.py
16_analyze_gpt_entropies.ipynb		16_analyze_gpt_entropies.ipynb
17_word_pasting_experiment.ipynb		17_word_pasting_experiment.ipynb
18_word_pasting_analysis.ipynb		18_word_pasting_analysis.ipynb
18bis_word_pasting_10000.ipynb		18bis_word_pasting_10000.ipynb
19_nsb.ipynb		19_nsb.ipynb
20_word_pasting_debug.ipynb		20_word_pasting_debug.ipynb
21_join_1000_10000.ipynb		21_join_1000_10000.ipynb
22_word_pasting_json_to_csv.py		22_word_pasting_json_to_csv.py
23_word_splitting_json_to_csv.py		23_word_splitting_json_to_csv.py
24_word_splitting_analysis.ipynb		24_word_splitting_analysis.ipynb
25_pearson.ipynb		25_pearson.ipynb
26_time_dimension.ipynb		26_time_dimension.ipynb
27_explore_bpe.ipynb		27_explore_bpe.ipynb
28_sample_en_text.txt		28_sample_en_text.txt
29_pasting_splitting.ipynb		29_pasting_splitting.ipynb
30_transition.ipynb		30_transition.ipynb
31_transition_whitespace_split.ipynb		31_transition_whitespace_split.ipynb
32_word_splitting_analysis.ipynb		32_word_splitting_analysis.ipynb
33_merge_and_check.py		33_merge_and_check.py
34_paper_plots.ipynb		34_paper_plots.ipynb
35_odd_transitions.ipynb		35_odd_transitions.ipynb
36_split_on_space.ipynb		36_split_on_space.ipynb
8_entropy_all_eng_bibles.ipynb		8_entropy_all_eng_bibles.ipynb
9_lowercasing.ipynb		9_lowercasing.ipynb
9bis_koplenig_fig.py		9bis_koplenig_fig.py
LICENSE		LICENSE
README.md		README.md
UnderstandParallelBibleCorpus.ipynb		UnderstandParallelBibleCorpus.ipynb
__init__.py		__init__.py
all_entropies.csv		all_entropies.csv
analysis.py		analysis.py
bibles_to_exclude.txt		bibles_to_exclude.txt
compression.py		compression.py
compression_entropy.py		compression_entropy.py
convergence_plots.py		convergence_plots.py
create_full_information_csv.py		create_full_information_csv.py
data.py		data.py
entropy_convergence.py		entropy_convergence.py
explore_bpe.ipynb		explore_bpe.ipynb
generate.py		generate.py
gpt_pbc_entropy.py		gpt_pbc_entropy.py
koplenig_plots.py		koplenig_plots.py
loss_function.ipynb		loss_function.ipynb
minimal_information.py		minimal_information.py
nn_pasting.py		nn_pasting.py
perplexity.ipynb		perplexity.ipynb
randomWikipediaPage.txt		randomWikipediaPage.txt
read_gradients.ipynb		read_gradients.ipynb
read_slurm.ipynb		read_slurm.ipynb
requirements.txt		requirements.txt
sentence_length.ipynb		sentence_length.ipynb
train.py		train.py
train.sh		train.sh
train_hahn.py		train_hahn.py
two_bible_lines.py		two_bible_lines.py
use_saved_model.ipynb		use_saved_model.ipynb
use_saved_models.ipynb		use_saved_models.ipynb
util.py		util.py
word_pasting.py		word_pasting.py
word_pasting_pos.py		word_pasting_pos.py
word_splitting.py		word_splitting.py
word_splitting.sh		word_splitting.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Noun-Noun Compounds

Requirements

Mismatcher

Running this code

`n_merges_full`

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

PabloMosUU/WordOrderBibles

Folders and files

Latest commit

History

Repository files navigation

Noun-Noun Compounds

Requirements

Mismatcher

Running this code

n_merges_full

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

`n_merges_full`

Packages