Add xnli#134
Conversation
jon-tow
left a comment
There was a problem hiding this comment.
Hi, @gentaiscool 👋 Thanks for adding this!
I left a few change requests that boil down to promptsource not supporting non-English XNLI sets. To make sure this task works out-of-box, let's remove all non-English tasks for now.
| class XNLIFr(XNLI): | ||
| DATASET_NAME = "fr" | ||
|
|
||
| class XNLIEs(XNLI): | ||
| DATASET_NAME = "es" | ||
|
|
||
| class XNLIDe(XNLI): | ||
| DATASET_NAME = "de" | ||
|
|
||
| class XNLIEl(XNLI): | ||
| DATASET_NAME = "el" | ||
|
|
||
| class XNLIBg(XNLI): | ||
| DATASET_NAME = "bg" | ||
|
|
||
| class XNLIRu(XNLI): | ||
| DATASET_NAME = "ru" | ||
|
|
||
| class XNLITr(XNLI): | ||
| DATASET_NAME = "tr" | ||
|
|
||
| class XNLIAr(XNLI): | ||
| DATASET_NAME = "ar" | ||
|
|
||
| class XNLIVi(XNLI): | ||
| DATASET_NAME = "vi" | ||
|
|
||
| class XNLITh(XNLI): | ||
| DATASET_NAME = "th" | ||
|
|
||
| class XNLIZh(XNLI): | ||
| DATASET_NAME = "zh" | ||
|
|
||
| class XNLIHi(XNLI): | ||
| DATASET_NAME = "hi" | ||
|
|
||
| class XNLISw(XNLI): | ||
| DATASET_NAME = "sw" | ||
|
|
||
| class XNLIUr(XNLI): | ||
| DATASET_NAME = "ur" |
There was a problem hiding this comment.
Remove these classes. Unfortunately, English is currently the only language with promptsource support on the eval-hackathon branch (see here).
| def training_docs(self): | ||
| if self.has_training_docs(): | ||
| return self.dataset["train"] | ||
|
|
||
| def validation_docs(self): | ||
| if self.has_validation_docs(): | ||
| return self.dataset["validation"] | ||
|
|
||
|
|
There was a problem hiding this comment.
Add a test_docs method since the test set is available in the HuggingFace datasets.
| XNLIFr, | ||
| XNLIEs, | ||
| XNLIDe, | ||
| XNLIEl, | ||
| XNLIBg, | ||
| XNLIRu, | ||
| XNLITr, | ||
| XNLIAr, | ||
| XNLIVi, | ||
| XNLITh, | ||
| XNLIZh, | ||
| XNLIHi, | ||
| XNLISw, | ||
| XNLIUr |
There was a problem hiding this comment.
Remove these tasks (see comment above about lack of promptsource support for non-English tasks).
| def construct_tasks() -> typing.Dict[str, XNLI]: | ||
| """ | ||
| Returns a dictionary of tasks keyed by task name, for example: | ||
| "GEM/wiki_lingua_ar" |
There was a problem hiding this comment.
Change this key to an XNLI matching example, e.g. "xnli_en".
| "xnli_fr": xnli.XNLIFr, | ||
| "xnli_es": xnli.XNLIEs, | ||
| "xnli_de": xnli.XNLIDe, | ||
| "xnli_el": xnli.XNLIEl, | ||
| "xnli_bg": xnli.XNLIBg, | ||
| "xnli_ru": xnli.XNLIRu, | ||
| "xnli_tr": xnli.XNLITr, | ||
| "xnli_ar": xnli.XNLIAr, | ||
| "xnli_vi": xnli.XNLIVi, | ||
| "xnli_th": xnli.XNLITh, | ||
| "xnli_zh": xnli.XNLIZh, | ||
| "xnli_hi": xnli.XNLIHi, | ||
| "xnli_sw": xnli.XNLISw, | ||
| "xnli_ur": xnli.XNLIUr, |
There was a problem hiding this comment.
Remove these tasks (see comment above about lack of promptsource support for non-English tasks).
|
I would love to be part of this conversation as well. Right now the multilingual modeling group is trying to perform evaluation on non-English tasks, and it seems like we have to forked both Eval-Harness and PromptSource to extend the prompt-based evaluation for non-EN tasks. Am I right? |
|
Hi @yongzx ! That's one way to do it. You'd have to:
Lastly, make sure your templates can be accessed from the harness. For example, using the XNLI French subset, run the following in a Python interpreter: Once you see the templates listed, you should be ready to evaluate as usual. Let me know if you run into any issues (most problems stem from setting up a consistent Python virtual environment). I'll be glad to help! |
|
@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts? |
I didn't use eval harness, but https://github.com/Muennighoff/t-zero/blob/muennighoff/upgrdps/evaluation/run_eval.py |
|
Thanks @jon-tow and @Muennighoff!! |
|
@jon-tow I actually did what you suggested. For instance: |
|
But strangely evaluating with Will try with Niklas' repo. |
|
Thanks for the updates, @yongzx ! Did you obtain significantly different accuracies when using Niklas's repo? Re:
|
|
I obtained the same accuracies with the BLOOM model, but with Niklas' repo, I have gotten better accuracies using a different model (BLOOMZ). I haven't tried BLOOMZ with eval-harness yet. |
Adding xnli to lm-evaluation-harness