about fp16 #22

666zz666 · 2019-02-20T13:25:13Z

When I use fp16 (16-bit float) and multi-gpu training,the code will wait in SyncBN(comm.py).

vacancy · 2019-02-20T13:36:28Z

I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.

666zz666 · 2019-02-20T14:10:01Z

Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.

1. batchnorm.py:

    if self._parallel_id == 0:
        mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))

2.comm.py:

results = self._master_callback(intermediates)

The error is 'An error occured.'

My try-catch like this:

except IOError:
print('An error occured trying to read the file.')

except ValueError:
print('Non-numeric data found in the file.')

except ImportError:
print "NO module found"

except EOFError:
print('Why did you do an EOF on me?')

except KeyboardInterrupt:
print('You cancelled the operation.')

except:
print('An error occured.')

vacancy · 2019-02-20T15:07:17Z

Can you give detailed information about the "error"?

For example, you may directly wrap the whole function body of forward() with a try-catch statement:

try:
    # original codes
except:
    import traceback
    traceback.print_exc()

666zz666 · 2019-02-21T02:53:34Z

The detailed information

Traceback (most recent call last):
File "/mnt/data-2/data/cnn_multi_/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward
mean, inv_std = self._sync_master.run_master(ChildMessage(input_sum, input_ssum, sum_size))
File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master
results = self.master_callback(intermediates)
File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in data_parallel_master
mean, inv_std = self.compute_mean_std(sum, ssum, sum_size)
File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in compute_mean_std
mean = sum / size
RuntimeError: value cannot be converted to type at::Half without overflow: 528392

vacancy · 2019-02-21T14:18:17Z

Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size? Can you double check?

I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.

666zz666 closed this as completed Feb 21, 2019

666zz666 reopened this Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about fp16 #22

about fp16 #22

666zz666 commented Feb 20, 2019

vacancy commented Feb 20, 2019

666zz666 commented Feb 20, 2019

vacancy commented Feb 20, 2019

666zz666 commented Feb 21, 2019 •

edited

Loading

vacancy commented Feb 21, 2019

about fp16 #22

about fp16 #22

Comments

666zz666 commented Feb 20, 2019

vacancy commented Feb 20, 2019

666zz666 commented Feb 20, 2019

1. batchnorm.py:

2.comm.py:

The error is 'An error occured.'

My try-catch like this:

vacancy commented Feb 20, 2019

666zz666 commented Feb 21, 2019 • edited Loading

The detailed information

vacancy commented Feb 21, 2019

666zz666 commented Feb 21, 2019 •

edited

Loading