All scripts are written in Python and follow the order outlined here:
1_nlp_extract_labels.py: pre-processes data, included lemmatizing, making all strings lowercase, and removing stop words. Spelling discrepancies are resolved using “spelling_corrections.xlsx”, and synonyms are replaced using “synonyms.xlsx.” The “string_concatenation.xlsx” file was generated through iteratively searching bigrams and trigrams to extract labels.
2_nlp_manual_tagging.py: samples data to generate the example sets for one-shot learning for multi-label text classification by the GPT model. These need to be manually labeled.
3_nlp_textcat_llm.py: performs multi-label text classification using the GPT model and labels specified in the config files, “zeroshot_all.cfg” and “oneshot_all.cfg."
4_nlp_evaluate_models.py: evaluates the GPT model performances by comparing the GPT labeled data to the manually labeled data, available in “manual_tagging.xlsx”. Micro-precision, micro-recall, and micro-F1-scores are calculated.
5_nlp_figures.py: generates all figures and tables included in the manuscript
nlp_functions.py: includes the functions necessary for 1_nlp_extract_labels.py