Hello,
I think the following questions are not prioritized, comparing to these PR and issue #4, so intentionally submitting this as a new issue.
-Simpler first-
-
Is allennlp actually required?
README says allennlp is one of the requirements, but I couldn't find any program in this repository using allennlp
-
Are four NVIDIA TESLA P100 enough?
I used four NVIDIA Tesla V100 SXM2 (16GB video memory for each), but faced CUDA out of memory when using your provided configuration in README with train_batch_size=32 for both SQuAD and TriviaQA datasets.
07/08/2019 20:40:20 - INFO - __main__ - output_dir: out/squad_doc/02
07/08/2019 20:40:23 - INFO - __main__ - torch_version: 1.1.0 device: cuda n_gpu: 4, distributed training: False
, 16-bits training: False
07/08/2019 20:40:23 - INFO - __main__ - ***** Preparing model *****
07/08/2019 20:40:24 - INFO - __main__ - Loading model from pretrained checkpoint: bert-base-uncased/pytorch_mod
el.bin
07/08/2019 20:40:24 - INFO - __main__ - Weights of BertForRankingAndReadingAndReranking not initialized from pr
etrained model: ['rank_affine.weight', 'rank_affine.bias', 'read_affine.weight', 'read_affine.bias', 'rerank_affi
ne.weight', 'rerank_affine.bias', 'rank_ffn.dense.weight', 'rank_ffn.dense.bias', 'rank_ffn.affine.weight', 'rank
_ffn.affine.bias', 'rerank_ffn.dense.weight', 'rerank_ffn.dense.bias', 'rerank_ffn.affine.weight', 'rerank_ffn.af
fine.bias']
07/08/2019 20:40:24 - INFO - __main__ - Weights from pretrained model not used in BertForRankingAndReadingAndRe
ranking: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias
', 'cls.predictions.transform.LayerNorm.gamma', 'cls.predictions.transform.LayerNorm.beta', 'cls.predictions.deco
der.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
07/08/2019 20:40:29 - INFO - __main__ - ***** Preparing training *****
07/08/2019 20:40:34 - INFO - __main__ - Loading examples from: data/RE3QA/squad/train_8paras_examples
.pkl
07/08/2019 20:42:56 - INFO - __main__ - Loading features from: data/RE3QA/squad/train_8paras_384max_1
28stride_features.pkl
07/08/2019 20:42:56 - INFO - __main__ - Filtering features randomly
07/08/2019 20:42:58 - INFO - __main__ - Num orig examples = 87599
07/08/2019 20:42:58 - INFO - __main__ - Num split features = 746625
07/08/2019 20:42:58 - INFO - __main__ - Num split filtered features = 343850
07/08/2019 20:42:58 - INFO - __main__ - Batch size for ranker = 69
07/08/2019 20:42:58 - INFO - __main__ - Batch size for reader = 32
07/08/2019 20:42:58 - INFO - __main__ - Num steps = 21490
07/08/2019 20:43:22 - INFO - __main__ - ***** Preparing evaluation *****
07/08/2019 20:43:23 - INFO - __main__ - Loading examples from: data/RE3QA/squad/eval_10paras_examples
.pkl
07/08/2019 20:43:52 - INFO - __main__ - Loading features from: data/RE3QA/squad/eval_10paras_384max_1
28stride_features.pkl
07/08/2019 20:43:52 - INFO - __main__ - Filtering features randomly
07/08/2019 20:43:52 - INFO - __main__ - Num orig examples = 10570
07/08/2019 20:43:52 - INFO - __main__ - Num split features = 122413
07/08/2019 20:43:52 - INFO - __main__ - Num split filtered features = 42279
07/08/2019 20:43:52 - INFO - __main__ - Batch size for ranker = 64
07/08/2019 20:43:52 - INFO - __main__ - Batch size for reader = 32
07/08/2019 20:43:56 - INFO - __main__ - ***** Running training distillation *****
07/08/2019 20:43:56 - INFO - __main__ - Processing example: 0
07/08/2019 20:53:28 - INFO - __main__ - Processing example: 345000
W07/08/2019 21:02:34 - INFO - __main__ - Processing example: 690000
07/08/2019 21:04:32 - INFO - __main__ - ***** Reconstruct training data at distill_8paras_4best.pkl *****
07/08/2019 21:04:32 - INFO - __main__ - Filtering features based on: out/squad_doc/02/distill_8paras_4best.pkl
07/08/2019 21:43:07 - INFO - __main__ - Num orig examples = 87599
07/08/2019 21:43:07 - INFO - __main__ - Num split features = 746625
07/08/2019 21:43:07 - INFO - __main__ - Num split filtered features = 349167
07/08/2019 21:43:07 - INFO - __main__ - Batch size for ranker = 68
07/08/2019 21:43:07 - INFO - __main__ - Batch size for reader = 32
07/08/2019 21:43:07 - INFO - __main__ - Num steps = 21822
07/08/2019 21:43:32 - INFO - __main__ - ***** Running eval distillation *****
07/08/2019 21:43:32 - INFO - __main__ - Processing example: 0
07/08/2019 21:44:40 - INFO - __main__ - Processing example: 40000
07/08/2019 21:45:49 - INFO - __main__ - Processing example: 80000
07/08/2019 21:46:58 - INFO - __main__ - Processing example: 120000
07/08/2019 21:47:03 - INFO - __main__ - ***** Reconstruct eval data at test_10paras_4best.pkl *****
07/08/2019 21:47:03 - INFO - __main__ - Filtering features based on: out/squad_doc/02/test_10paras_4best.pkl
07/08/2019 21:47:04 - INFO - __main__ - Num orig examples = 10570
07/08/2019 21:47:04 - INFO - __main__ - Num split features = 122413
07/08/2019 21:47:04 - INFO - __main__ - Num split filtered features = 42279
07/08/2019 21:47:04 - INFO - __main__ - Batch size for ranker = 64
07/08/2019 21:47:04 - INFO - __main__ - Batch size for reader = 32
07/08/2019 21:47:07 - INFO - __main__ - ***** Preparing optimizer *****
07/08/2019 21:47:07 - INFO - __main__ - ***** Running training *****
07/08/2019 21:47:07 - INFO - __main__ - ***** Epoch: 1 *****
/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/parallel/_functions.py$
61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsquee$
e and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/workspace/RE3QA/bert/run_squad_document_full_e2e.py", line 914, in <module>
main()
File "/home/ubuntu/workspace/RE3QA/bert/run_squad_document_full_e2e.py", line 857, in main
save_path, best_f1, epoch)
File "/home/ubuntu/workspace/RE3QA/bert/run_squad_document_full_e2e.py", line 487, in run_train_epoch
input_ids=input_ids, token_type_ids=segment_ids)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/modules/module$
py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/parallel/data_$
arallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/parallel/data_$
arallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/parallel/paral$
el_apply.py", line 83, in parallel_apply
raise output File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/workspace/RE3QA/bert/custom_modeling.py", line 306, in forward
all_encoder_layers, _ = self.bert(self.num_hidden_read, input_ids, token_type_ids, attention_mask)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/workspace/RE3QA/bert/custom_modeling.py", line 166, in forward
all_encoder_layers = self.encoder(num_hidden_stop, embedding_output, extended_attention_mask)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/workspace/RE3QA/bert/custom_modeling.py", line 131, in forward
hidden_states = layer_module(hidden_states, attention_mask)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/workspace/RE3QA/bert/modeling.py", line 273, in forward
intermediate_output = self.intermediate(attention_output)
File "/home/ubuntu/.local/share/virtualenvs/RE3QA-pRGEMyAS/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/workspace/RE3QA/bert/modeling.py", line 246, in forward
hidden_states = self.intermediate_act_fn(hidden_states)
File "/home/ubuntu/workspace/RE3QA/bert/modeling.py", line 35, in gelu
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.75 GiB total capacity; 14.25 GiB already allocated; 30.19 MiB free; 365.16 MiB cached)
- For the above reason, I needed to set 16 at train_batch_size for both SQuAD and TriviaQA datasets, and got the following results.
Compared to your reported results, I feel some performance gap especially for TriviaQA dataset.
Do you think this is just because of different training batch size (32 -> 16)?
If you used different configurations to get the reported results, could you provide them and tell me which table in the paper I should refer to for comparison?
SQuAD -document-
Ranker, type: distill, step: 0, map: 0.492, mrr: 0.510, top_1: 0.309, top_3: 0.610, top_5: 0.814, top_7: 0.932, retrieval_rate: 0.468
Ranker, type: test, step: 0, map: 0.395, mrr: 0.412, top_1: 0.230, top_3: 0.467, top_5: 0.655, top_7: 0.800, retrieval_rate: 0.345
Ranker, type: distill, step: 0, map: 0.492, mrr: 0.510, top_1: 0.309, top_3: 0.610, top_5: 0.814, top_7: 0.932, retrieval_rate: 0.468
Ranker, type: test, step: 0, map: 0.395, mrr: 0.412, top_1: 0.230, top_3: 0.467, top_5: 0.655, top_7: 0.800, retrieval_rate: 0.345
Ranker, step: 21823, map: 0.888, mrr: 0.906, top_1: 0.872, top_3: 0.938, top_5: 0.956, top_7: 0.961
Reader, step: 21823, em: 45.535, f1: 51.553
Ranker, type: distill, step: 21823, map: 0.955, mrr: 0.967, top_1: 0.945, top_3: 0.988, top_5: 0.997, top_7: 0.999, retrieval_rate: 0.468
Ranker, type: test, step: 21823, map: 0.888, mrr: 0.906, top_1: 0.872, top_3: 0.938, top_5: 0.956, top_7: 0.961, retrieval_rate: 0.345
Ranker, step: 43646, map: 0.889, mrr: 0.907, top_1: 0.871, top_3: 0.939, top_5: 0.956, top_7: 0.962
Reader, step: 43646, em: 76.500, f1: 83.212
Ranker, type: test, step: 43646, map: 0.883, mrr: 0.909, top_1: 0.867, top_3: 0.943, top_5: 0.965, top_7: 0.976, retrieval_rate: 0.223
Reader, type: test, step: 43646, em: 77.550, f1: 84.379
TriviaQA -wiki-
Ranker, type: distill, step: 0, map: 0.632, mrr: 0.670, top_1: 0.537, top_3: 0.745, top_5: 0.850, top_7: 0.909, retrieval_rate: 0.406
Ranker, type: dev, step: 0, map: 0.594, mrr: 0.636, top_1: 0.514, top_3: 0.706, top_5: 0.800, top_7: 0.855, retrieval_rate: 0.349
Ranker, type: distill, step: 0, map: 0.632, mrr: 0.670, top_1: 0.537, top_3: 0.745, top_5: 0.850, top_7: 0.909, retrieval_rate: 0.406
Ranker, type: dev, step: 0, map: 0.594, mrr: 0.636, top_1: 0.514, top_3: 0.706, top_5: 0.800, top_7: 0.855, retrieval_rate: 0.349
Ranker, type: distill, step: 0, map: 0.632, mrr: 0.670, top_1: 0.538, top_3: 0.744, top_5: 0.850, top_7: 0.909, retrieval_rate: 0.406
Ranker, type: dev, step: 0, map: 0.595, mrr: 0.636, top_1: 0.514, top_3: 0.707, top_5: 0.801, top_7: 0.855, retrieval_rate: 0.349
Ranker, type: distill, step: 0, map: 0.632, mrr: 0.670, top_1: 0.538, top_3: 0.744, top_5: 0.850, top_7: 0.909, retrieval_rate: 0.406
Ranker, type: dev, step: 0, map: 0.595, mrr: 0.636, top_1: 0.514, top_3: 0.707, top_5: 0.801, top_7: 0.855, retrieval_rate: 0.349
Ranker, step: 22334, loss: 2.610, map: 0.776, mrr: 0.849, top_1: 0.797, top_3: 0.890, top_5: 0.920, top_7: 0.933
Reader, step: 22334, loss: 2.610, em: 40.636, f1: 53.360
Ranker, type: distill, step: 22334, map: 0.839, mrr: 0.903, top_1: 0.852, top_3: 0.941, top_5: 0.970, top_7: 0.984, retrieval_rate: 0.406
Ranker, type: dev, step: 22334, map: 0.776, mrr: 0.849, top_1: 0.797, top_3: 0.890, top_5: 0.920, top_7: 0.933, retrieval_rate: 0.349
Ranker, step: 44668, loss: 2.388, map: 0.784, mrr: 0.855, top_1: 0.807, top_3: 0.894, top_5: 0.921, top_7: 0.933
Reader, step: 44668, loss: 2.388, em: 51.295, f1: 64.485
Ranker, type: dev, step: 44668, map: 0.784, mrr: 0.855, top_1: 0.807, top_3: 0.894, top_5: 0.921, top_7: 0.933, retrieval_rate: 0.349
Reader, type: dev, step: 44668, em: 51.483, f1: 64.621
TriviaQA -unfiltered-
Ranker, type: distill, step: 0, map: 0.778, mrr: 0.809, top_1: 0.720, top_3: 0.872, top_5: 0.929, top_7: 0.957, retrieval_rate: 0.322
Ranker, type: dev, step: 0, map: 0.616, mrr: 0.645, top_1: 0.559, top_3: 0.701, top_5: 0.758, top_7: 0.791, retrieval_rate: 0.294
Ranker, type: distill, step: 0, map: 0.778, mrr: 0.809, top_1: 0.720, top_3: 0.872, top_5: 0.929, top_7: 0.957, retrieval_rate: 0.322
Ranker, type: dev, step: 0, map: 0.616, mrr: 0.645, top_1: 0.559, top_3: 0.701, top_5: 0.758, top_7: 0.791, retrieval_rate: 0.294
Ranker, step: 28727, loss: 2.729, map: 0.735, mrr: 0.780, top_1: 0.748, top_3: 0.804, top_5: 0.824, top_7: 0.832
Reader, step: 28727, loss: 2.729, em: 57.695, f1: 62.731
Ranker, type: distill, step: 28727, map: 0.900, mrr: 0.941, top_1: 0.912, top_3: 0.964, top_5: 0.981, top_7: 0.988, retrieval_rate: 0.322
Ranker, type: dev, step: 28727, map: 0.735, mrr: 0.780, top_1: 0.748, top_3: 0.804, top_5: 0.824, top_7: 0.832, retrieval_rate: 0.294
Ranker, step: 57454, loss: 2.763, map: 0.738, mrr: 0.781, top_1: 0.750, top_3: 0.804, top_5: 0.824, top_7: 0.832
Reader, step: 57454, loss: 2.763, em: 63.794, f1: 69.462
Ranker, type: dev, step: 57454, map: 0.738, mrr: 0.781, top_1: 0.750, top_3: 0.804, top_5: 0.824, top_7: 0.832, retrieval_rate: 0.294
Reader, type: dev, step: 57454, em: 63.714, f1: 69.311
Thank you!
Hello,
I think the following questions are not prioritized, comparing to these PR and issue #4, so intentionally submitting this as a new issue.
-Simpler first-
Is allennlp actually required?
README says allennlp is one of the requirements, but I couldn't find any program in this repository using allennlp
Are four NVIDIA TESLA P100 enough?
I used four NVIDIA Tesla V100 SXM2 (16GB video memory for each), but faced CUDA out of memory when using your provided configuration in README with train_batch_size=32 for both SQuAD and TriviaQA datasets.
Compared to your reported results, I feel some performance gap especially for TriviaQA dataset.
Do you think this is just because of different training batch size (32 -> 16)?
If you used different configurations to get the reported results, could you provide them and tell me which table in the paper I should refer to for comparison?
SQuAD -document-
TriviaQA -wiki-
TriviaQA -unfiltered-
Thank you!