Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

ZHO9504 · 2019-07-20T07:18:56Z

Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
EM: 61.2193, f1: 69.6262, qas_used_fraction: 1.0000, loss: 4.3453 ||: : 17502it [6:26:59, 1.33s/it]
2019-07-20 15:09:22,954 - INFO - allennlp.training.trainer - Validating
EM: 48.9301, f1: 59.0550, qas_used_fraction: 1.0000, loss: 5.1889 ||: : 94it [00:41, 2.15it/s]Traceback (most recent call last):
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 21, in
run()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
args.cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
cache_directory, cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 493, in train
val_loss, num_batches = self._validation_loss()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss
loss = self.batch_loss(batch_group, for_training=False)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 258, in batch_loss
output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/util.py", line 336, in data_parallel
losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: tensor.ndimension() == static_cast<int64_t>(expected_size.size()) ASSERT FAILED at /pytorch/torch/csrc/cuda/comm.cpp:232, please report a bug to PyTorch. (gather at /pytorch/torch/csrc/cuda/comm.cpp:232)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6d3dad8441 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6d3dad7d7a in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x962 (0x7f6d132be792 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #3: + 0x5a3d1c (0x7f6d33e0bd1c in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130fac (0x7f6d33998fac in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5567e0e3c6e4 in python3.7)
frame #6: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #7: _PyEval_EvalFrameDefault + 0x4e8c (0x5567e0e982bc in python3.7)
frame #8: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #9: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #10: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #11: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #12: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #13: THPFunction_apply(_object, _object*) + 0x6b1 (0x7f6d33c1c301 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #14: PyCFunction_Call + 0xe7 (0x5567e0dffbe7 in python3.7)
frame #15: _PyEval_EvalFrameDefault + 0x5d21 (0x5567e0e99151 in python3.7)
frame #16: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #17: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #18: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #19: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #20: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #21: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #23: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #25: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #26: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #27: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #28: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #29: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #30: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #32: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #33: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #35: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #37: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #38: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #39: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #40: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #41: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #42: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #43: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #44: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #45: PyEval_EvalCodeEx + 0x44 (0x5567e0dda3c4 in python3.7)
frame #46: PyEval_EvalCode + 0x1c (0x5567e0dda3ec in python3.7)
frame #47: + 0x1e004d (0x5567e0ea304d in python3.7)
frame #48: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x5567e0e3c569 in python3.7)
frame #49: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #50: _PyEval_EvalFrameDefault + 0x4755 (0x5567e0e97b85 in python3.7)
frame #51: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #52: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #53: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #54: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #55: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #56: + 0x222d77 (0x5567e0ee5d77 in python3.7)
frame #57: + 0x23ae95 (0x5567e0efde95 in python3.7)
frame #58: _Py_UnixMain + 0x3c (0x5567e0efdf7c in python3.7)
frame #59: __libc_start_main + 0xf0 (0x7f6d4ea12830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #60: + 0x1e0122 (0x5567e0ea3122 in python3.7)

I don't know why....

ZHO9504 · 2019-07-20T15:21:21Z

My running script is,
python3.7 -m allennlp.run train /home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/baseline/MRQA_BERTLarge.jsonnet -s Models/large_f5/ -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/train/TriviaQA-web.jsonl.gz', 'validation_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/dev-indomain/TriviaQA-web.jsonl.gz', 'trainer': {'cuda_device': [0,1], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '50000'}}}" --include-package mrqa_allennlp

whatever the train_data_path,

alontalmor · 2019-07-21T06:19:13Z

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

ZHO9504 · 2019-07-21T09:22:21Z

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

Thank you for your reply. The version of allennlp I use:
$ allennlp --version
allennlp 0.8.5-unreleased`
and had same issue using V0.8.4
torch1.1.0
It's ok when validate the data: HotpotQA\SearchQA using one or two gpu.
But have the issue when valating trival/NaturalQuestionsShort/SearchQA with 2 gpu.
A little strange.....

alontalmor · 2019-07-21T19:34:21Z

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

ZHO9504 · 2019-07-22T00:51:21Z

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

Yes, I evaluated on each of them , but only HotpotQA or SearchQA went well.
And, as long as the evaluation data include such as TriviaQA, then procedure error

alontalmor · 2019-07-23T12:52:16Z

Ok i'm trying to recreate and solve this, but it may take a few days.

Alex-Fabbri · 2019-08-06T19:10:01Z

I also got this error during multi-gpu validation but fine on a single gpu. Using allennlp V0.8.4 and torch 1.1.0.

Kaimary · 2019-10-26T17:22:27Z

+1.
I also got this error during multiple-gpu validation phrase. Using allennlp V0.8.4 and torch 1.1.0.

lucadiliello · 2022-01-12T09:14:29Z

I was able to train on every MRQA task using every number of GPUs using pytorch-lightning. I published the scripts here: https://github.com/lucadiliello/mrqa-lightning

ZHO9504 closed this as completed Jul 21, 2019

ZHO9504 reopened this Jul 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

ZHO9504 commented Jul 20, 2019 •

edited

Loading

ZHO9504 commented Jul 20, 2019 •

edited

Loading

alontalmor commented Jul 21, 2019

ZHO9504 commented Jul 21, 2019 •

edited

Loading

alontalmor commented Jul 21, 2019

ZHO9504 commented Jul 22, 2019

alontalmor commented Jul 23, 2019

Alex-Fabbri commented Aug 6, 2019

Kaimary commented Oct 26, 2019

lucadiliello commented Jan 12, 2022

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

Comments

ZHO9504 commented Jul 20, 2019 • edited Loading

ZHO9504 commented Jul 20, 2019 • edited Loading

alontalmor commented Jul 21, 2019

ZHO9504 commented Jul 21, 2019 • edited Loading

alontalmor commented Jul 21, 2019

ZHO9504 commented Jul 22, 2019

alontalmor commented Jul 23, 2019

Alex-Fabbri commented Aug 6, 2019

Kaimary commented Oct 26, 2019

lucadiliello commented Jan 12, 2022

ZHO9504 commented Jul 20, 2019 •

edited

Loading

ZHO9504 commented Jul 20, 2019 •

edited

Loading

ZHO9504 commented Jul 21, 2019 •

edited

Loading