This repository contains the LMentry benchmark from LMentry: A Language Model Benchmark of Elementary Language Tasks, as well as the code to evaluate it.
For any questions, feel free to open a GitHub issue or to contact us at avia.efrat@gmail.com 😊
Simply clone the repo:
git clone https://github.com/aviaefrat/lmentry.gitThe data is in the data directory.
We provide functions for generating predictions with Hugging Face and OpenAI models (see below), but you can generate predictions in any method of your choosing.
For Hugging Face and OpenAI models, you can use the
generate_all_hf_predictions and
generate_all_openai_predictions functions from predict.py. These are what we used in our experiments.
The easiest and recommended way is to use evalutate.py:
python -m lmentry.evaluateDon't forget to activate the lmentry environment (created from environment.yml) beforehand.
Using the --num-procs=N optional argument will score the predictions much faster.
evalutate.py will also automatically create files analyzing the results in a separate results dir.
To use evalutate.py, the predictions must follow the same structure of lmentry_predictions.zip (if you used our functions from predict.py, your predictions already follow this structure):
- The top-level directory should be named
predictions. predictionsneeds to contain exactly 41 directories, named after the 41 files indata(the 25 task names + the 16 files for the argument content robustness).- Each of the 41 task predictions directories should contain a prediction file for each model you want to evaluate. For example, to evaluate the predictions of a model named
my-model, each of the 41 directories should contain a file namedmy-model.jsonwith the model's predictions for this task. - Each predictions file should contain values in the form
"<id>": {"prediction": <prediction>},where theids correspond to those in the task's file indata.
- Clone the repository.
- Unzip
lmentry_predictions.zipinto the top-level lmentry directory. - run
evaluate.py(preferably with a not-very-small value for--num-procs, as there are 656 files to score...)