This repository contains code for running the 2nd place (Spanish-to-English) and 3rd place (English-to-Spanish and French-to-English) model in the Duolingo SLAM competition. The paper describing our approach can be found here.
Download from here and unzip in the "data" folder
To preprocess the data, run reprocess_syntax.py on each data file. See the
file's docstring for more details on getting google SyntaxNet set up. Then run
translate_frequency.py to generate external word-frequency features.
The model can then be trained to produce predictions on the dev set using
lightgbm_dev.py or on the test set using lightgbm_script.py. The language
trained on (en_es, fr_en, es_en, or all) and the number of user trained
on can be controlled using the --lang and --users flags.
Models trained on each individual language can be averaged with a model trained
on all languages using the average_models.py script.
To test the effects of removing different feature sets, first run
preprocess_to_pickle.py to create a pickled version of the data and cut down
on preprocessing time across different lesions. Then run run_lesion.py, using
the --lesion flag to choose the lesion experiment to conduct. See code or
paper for list of options.
The results of the lesions can be plotted using graph_lesions.r (in R, not python).