Add xnli by gentaiscool · Pull Request #134 · bigscience-workshop/lm-evaluation-harness

gentaiscool · 2022-10-07T01:58:22Z

Adding xnli to lm-evaluation-harness

jon-tow

Hi, @gentaiscool 👋 Thanks for adding this!

I left a few change requests that boil down to promptsource not supporting non-English XNLI sets. To make sure this task works out-of-box, let's remove all non-English tasks for now.

jon-tow · 2022-10-07T04:44:57Z

+class XNLIFr(XNLI):
+    DATASET_NAME = "fr"
+
+class XNLIEs(XNLI):
+    DATASET_NAME = "es"
+
+class XNLIDe(XNLI):
+    DATASET_NAME = "de"
+
+class XNLIEl(XNLI):
+    DATASET_NAME = "el"
+
+class XNLIBg(XNLI):
+    DATASET_NAME = "bg"
+
+class XNLIRu(XNLI):
+    DATASET_NAME = "ru"
+
+class XNLITr(XNLI):
+    DATASET_NAME = "tr"
+
+class XNLIAr(XNLI):
+    DATASET_NAME = "ar"
+
+class XNLIVi(XNLI):
+    DATASET_NAME = "vi"
+
+class XNLITh(XNLI):
+    DATASET_NAME = "th"
+
+class XNLIZh(XNLI):
+    DATASET_NAME = "zh"
+
+class XNLIHi(XNLI):
+    DATASET_NAME = "hi"
+
+class XNLISw(XNLI):
+    DATASET_NAME = "sw"
+
+class XNLIUr(XNLI):
+    DATASET_NAME = "ur"


Remove these classes. Unfortunately, English is currently the only language with promptsource support on the eval-hackathon branch (see here).

jon-tow · 2022-10-07T04:47:43Z

+    def training_docs(self):
+        if self.has_training_docs():
+            return self.dataset["train"]
+
+    def validation_docs(self):
+        if self.has_validation_docs():
+            return self.dataset["validation"]
+
+


Add a test_docs method since the test set is available in the HuggingFace datasets.

jon-tow · 2022-10-07T04:50:11Z

+    XNLIFr,
+    XNLIEs,
+    XNLIDe,
+    XNLIEl,
+    XNLIBg,
+    XNLIRu,
+    XNLITr,
+    XNLIAr,
+    XNLIVi,
+    XNLITh,
+    XNLIZh,
+    XNLIHi,
+    XNLISw,
+    XNLIUr


Remove these tasks (see comment above about lack of promptsource support for non-English tasks).

jon-tow · 2022-10-07T04:51:11Z

+def construct_tasks() -> typing.Dict[str, XNLI]:
+    """
+    Returns a dictionary of tasks keyed by task name, for example:
+        "GEM/wiki_lingua_ar"


Change this key to an XNLI matching example, e.g. "xnli_en".

jon-tow · 2022-10-07T04:52:45Z

+    "xnli_fr": xnli.XNLIFr,
+    "xnli_es": xnli.XNLIEs,
+    "xnli_de": xnli.XNLIDe,
+    "xnli_el": xnli.XNLIEl,
+    "xnli_bg": xnli.XNLIBg,
+    "xnli_ru": xnli.XNLIRu,
+    "xnli_tr": xnli.XNLITr,
+    "xnli_ar": xnli.XNLIAr,
+    "xnli_vi": xnli.XNLIVi,
+    "xnli_th": xnli.XNLITh,
+    "xnli_zh": xnli.XNLIZh,
+    "xnli_hi": xnli.XNLIHi,
+    "xnli_sw": xnli.XNLISw,
+    "xnli_ur": xnli.XNLIUr,


Remove these tasks (see comment above about lack of promptsource support for non-English tasks).

yongzx · 2022-10-07T06:00:14Z

I would love to be part of this conversation as well. Right now the multilingual modeling group is trying to perform evaluation on non-English tasks, and it seems like we have to forked both Eval-Harness and PromptSource to extend the prompt-based evaluation for non-EN tasks. Am I right?

jon-tow · 2022-10-07T06:59:39Z

Hi @yongzx ! That's one way to do it. You'd have to:

Fork this big-science/lm-evaluation-harness repo and set up the Python environment.

git clone https://github.com/{fork-name}/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"

Add the XNLI changes in this PR.
Fork promptsource and work from the eval-hackathon branch here. (To tighten things up, you can later make this a submodule of lm-eval. See this harness fork that uses custom templates for a custom task).

pip uninstall promptsource  # Remove the version installed by the harness setup.
git clone --single-branch --branch eval-hackathon https://github.com/{fork-name}/promptsource
pip install -e ./promptsource

Dump your prompt templates for the non-English subsets into the promptsource xnli template dir.

Lastly, make sure your templates can be accessed from the harness. For example, using the XNLI French subset, run the following in a Python interpreter:

import lm_eval
print(lm_eval.list_templates("xnli_fr"))

Once you see the templates listed, you should be ready to evaluate as usual.

Let me know if you run into any issues (most problems stem from setting up a consistent Python virtual environment). I'll be glad to help!

StellaAthena · 2022-10-07T19:15:59Z

@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?

Muennighoff · 2022-10-07T19:21:27Z

@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?

I didn't use eval harness, but https://github.com/Muennighoff/t-zero/blob/muennighoff/upgrdps/evaluation/run_eval.py

yongzx · 2022-10-12T18:10:56Z

Thanks @jon-tow and @Muennighoff!!

yongzx · 2022-10-12T18:24:24Z

@jon-tow I actually did what you suggested. For instance:

>>> import lm_eval
>>> print(lm_eval.list_templates("xnli_de"))
['GPT-3 style', 'MNLI crowdsource', 'always/sometimes/never', 'based on the previous passage', 'can we infer', 'claim true/false/inconclusive', 'consider always/sometimes/never', 'does it follow that', 'does this imply', 'guaranteed true', 'guaranteed/possible/impossible', 'justified in saying', 'must be true', 'should assume', 'take the following as truth']

yongzx · 2022-10-12T18:30:18Z

But strangely evaluating with xnli_en (English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).

Will try with Niklas' repo.

jon-tow · 2022-10-13T14:57:04Z

Thanks for the updates, @yongzx ! Did you obtain significantly different accuracies when using Niklas's repo? Re:

But strangely evaluating with xnli_en (English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).

yongzx · 2022-10-13T14:58:27Z

I obtained the same accuracies with the BLOOM model, but with Niklas' repo, I have gotten better accuracies using a different model (BLOOMZ). I haven't tried BLOOMZ with eval-harness yet.

add xnli

693c57e

gentaiscool requested review from StellaAthena and jon-tow as code owners October 7, 2022 01:58

import xnli to __init__

ef8d4b9

jon-tow requested changes Oct 7, 2022

View reviewed changes

This was referenced Apr 19, 2023

add xnli task ncassereau/lm-evaluation-harness#2

Merged

add xnli task #153

Closed

Conversation

gentaiscool commented Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-tow left a comment

Choose a reason for hiding this comment

Uh oh!

jon-tow Oct 7, 2022

Choose a reason for hiding this comment

Uh oh!

jon-tow Oct 7, 2022

Choose a reason for hiding this comment

Uh oh!

jon-tow Oct 7, 2022

Choose a reason for hiding this comment

Uh oh!

jon-tow Oct 7, 2022

Choose a reason for hiding this comment

Uh oh!

jon-tow Oct 7, 2022

Choose a reason for hiding this comment

Uh oh!

yongzx commented Oct 7, 2022

Uh oh!

jon-tow commented Oct 7, 2022

Uh oh!

StellaAthena commented Oct 7, 2022

Uh oh!

Muennighoff commented Oct 7, 2022

Uh oh!

yongzx commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongzx commented Oct 12, 2022

Uh oh!

yongzx commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-tow commented Oct 13, 2022

Uh oh!

yongzx commented Oct 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gentaiscool commented Oct 7, 2022 •

edited

Loading

yongzx commented Oct 12, 2022 •

edited

Loading

yongzx commented Oct 12, 2022 •

edited

Loading