Code and data for Evolution on the Lexical Workbench: Disentangling Frequency, Centrality, and Polysemy in Language Evolution
The codebase is comprised of 3 separate steps:
-
Sense compilation
Uses a recursive algorithm to traverse the spaCy syntax tree and identify all relevant adjective–noun pairs.
Output is a hierarchy of directories where the first level is each unique word and the second level is each unique decade. Each subfolder contains:
(1) A NumPy matrix (.npy) containing the embeddings for all nouns the adjective modified.
(i.e., for the collocationsRED car,SMALL car,BIG carin 1850, the shape of the embeddings matrix for 1850 would be(3, d), wheredis the dimensionality of the BERT embeddings).
(2) A two-column CSV where the first column is the relevant adjective and the second column is all of the nouns included in the embedding matrix. -
Semantic compilation
Takes the path containing the 2-level hierarchy of word/decade directories as input, traverses the directories, and compiles semantic measures (scalar values) for all embeddings matrices.
Result is a wide-format CSV where each row is a word and each column is an instance of measurexin decadey.
(The number of columns will be equal ton_measures * n_decades.) -
Causal compilation
Takes the wide-format CSV containing all words/measures and runs GMC analysis to determine causal relations.
In our case the null hypothesisH₀is that there is no unique variance in time seriesyexplained by variance in time seriesx, above and beyond the unique variance in time seriesxexplained byy.
Rejecting the null means that generalized measures of correlation identify a unique effect ofxony, asymmetric from the effect ofyonx, and significantly different from the effect ofxon itself.
To get started, simply run:
./setup_env.shYou will be prompted to provide a Hugging Face token in order to access BERT (needed for contextual embeddings in step 1). Then run:
conda activate lexical_evolution_envNote that you need to provide the folder of COHA CSVs yourself.
Each of the three compilation steps can be run individually in exp_compile.py, or sequentially in one command (run_full_experiment).
python run_sense_compile.py csv_folder output_dir --decadeRun sense compilation for a folder of CSVs and export the two-level word–decade sense hierarchy to an output directory.
The --decade flag allows optionally only running a CSV for a given decade (e.g., --decade 1840).
python run_semantic_compile.py staging_root output_csv_dirRun semantic compilation for the folder of CSVs (in staging_root directory) and export a single wide CSV with all measures to output_csv_dir.