-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
I have tried your code for multiple datasets:
> python multiqa.py train --datasets SQuAD1-1 --cuda_device 0,1
> python multiqa.py train --datasets NewsQA --cuda_device 0,1
> python multiqa.py train --datasets SearchQA --cuda_device 0,1
Following by corresponding evaluation:
> python multiqa.py evaluate --model model --datasets SQuAD1-1 --cuda_device 0 --models_dir 'models/SQuAD1-1/'
> python multiqa.py evaluate --model model --datasets NewsQA --cuda_device 0 --models_dir 'models/NewsQA/'
> python multiqa.py evaluate --model model --datasets SearchQA --cuda_device 0 --models_dir 'models/SearchQA/'
I am getting relatively bad scores (EM/F1):
- Squad1.1: 77.19 | 85.28
- NewsQA: 19.51 | 30.51
- SearchQA: 35.68 | 41.02
which suggests that I am not using proper hyper-params. Do you think that explains it?
If so, I would appreciate more clarify on this sentence from your paper: "We emphasize that in all our experiments we use exactly the same training procedure for all datasets, with minimal hyper-parameter tuning." especially with respect to "minimal hyper-parameter tuning".
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels