-
Notifications
You must be signed in to change notification settings - Fork 2
Weighting concordances #2
Description
The weightConcordances.R script has some functions meant to create weighted concordances where context words captured by a model are in bold, the target item is a <span> element of class 'target' and, if required, PPMI values are added in superscript. It is a bit hard-coded with QLVLNewsCorpus quirks --- notably the cleanWord() function (which, however, shouldn't hurt other corpora that much).
The main interface function, so to speak, is weightContexts, whose main arguments are lemma and github_dir. Assuming the necessary files are in the github_dir, it searches three files created in createClouds.ipynb:
- the variables file (
{github_dir}/{lemma}.variables.tsv), which corresponds to the output ofweight_data["token_register"]; - the cws_detail file (
{github_dir}/{lemma}.cws.detail.tsv), the output oflistContextwords(); - and the PPMI file (
{github_dir}/{lemma}.ppmi.tsv), the automatically-stored side-effect oftargetPPMI().
You can adjust the names of those paths if they do not correspond to those templates. It also assumes you want to store the output of this function in the same variables file.
Therefore, you would run something like weightContexts('heffen', '../github/heffen/') and it would create both the raw context and the weighted context per FOC configuration and store it in the appropriate file.
Things that are kind of hard-coded and should be flexibilized (so, suggestions are welcome):
cleanWord()(make it optional? an argument ofweightContexts()that is a function, such ascleanWord()or a custom one?)- In line 23 (
filterFOC()),pos_selis hard-coded for the lex/all distinction: maybe have an argument that is a named list that the used can provide? - In line 32 (
filterFOC()) the fact that models without PPMI selection end in 'no' is hard-coded. - Lines 71-73 (
weightContexts()) assume that you have multiple variants in the column names of the PPMI dataframe AND that it is based on window size of 4. This definitely has to be flexibilized. - In line 99 (
weightcontexts()), the fact that we want superindices for models that end in 'weight' is hard-coded.