Skip to content

Weighting concordances #2

@montesmariana

Description

@montesmariana

The weightConcordances.R script has some functions meant to create weighted concordances where context words captured by a model are in bold, the target item is a <span> element of class 'target' and, if required, PPMI values are added in superscript. It is a bit hard-coded with QLVLNewsCorpus quirks --- notably the cleanWord() function (which, however, shouldn't hurt other corpora that much).

The main interface function, so to speak, is weightContexts, whose main arguments are lemma and github_dir. Assuming the necessary files are in the github_dir, it searches three files created in createClouds.ipynb:

  • the variables file ({github_dir}/{lemma}.variables.tsv), which corresponds to the output of weight_data["token_register"];
  • the cws_detail file ({github_dir}/{lemma}.cws.detail.tsv), the output of listContextwords();
  • and the PPMI file ({github_dir}/{lemma}.ppmi.tsv), the automatically-stored side-effect of targetPPMI().

You can adjust the names of those paths if they do not correspond to those templates. It also assumes you want to store the output of this function in the same variables file.

Therefore, you would run something like weightContexts('heffen', '../github/heffen/') and it would create both the raw context and the weighted context per FOC configuration and store it in the appropriate file.


Things that are kind of hard-coded and should be flexibilized (so, suggestions are welcome):

  • cleanWord() (make it optional? an argument of weightContexts() that is a function, such as cleanWord() or a custom one?)
  • In line 23 (filterFOC()), pos_sel is hard-coded for the lex/all distinction: maybe have an argument that is a named list that the used can provide?
  • In line 32 (filterFOC()) the fact that models without PPMI selection end in 'no' is hard-coded.
  • Lines 71-73 (weightContexts()) assume that you have multiple variants in the column names of the PPMI dataframe AND that it is based on window size of 4. This definitely has to be flexibilized.
  • In line 99 (weightcontexts()), the fact that we want superindices for models that end in 'weight' is hard-coded.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions