Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about fp16 #22

Open
666zz666 opened this issue Feb 20, 2019 · 5 comments
Open

about fp16 #22

666zz666 opened this issue Feb 20, 2019 · 5 comments

Comments

@666zz666
Copy link

When I use fp16 (16-bit float) and multi-gpu training,the code will wait in SyncBN(comm.py).
tim 20190220212250

@vacancy
Copy link
Owner

vacancy commented Feb 20, 2019

I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.

@666zz666
Copy link
Author

Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.

1. batchnorm.py:

    if self._parallel_id == 0:
        mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))

2.comm.py:

results = self._master_callback(intermediates)

The error is 'An error occured.'

My try-catch like this:

except IOError:
print('An error occured trying to read the file.')

except ValueError:
print('Non-numeric data found in the file.')

except ImportError:
print "NO module found"

except EOFError:
print('Why did you do an EOF on me?')

except KeyboardInterrupt:
print('You cancelled the operation.')

except:
print('An error occured.')

@vacancy
Copy link
Owner

vacancy commented Feb 20, 2019

Can you give detailed information about the "error"?

For example, you may directly wrap the whole function body of forward() with a try-catch statement:

try:
    # original codes
except:
    import traceback
    traceback.print_exc()

@666zz666
Copy link
Author

666zz666 commented Feb 21, 2019

The detailed information

Traceback (most recent call last):
File "/mnt/data-2/data/cnn_multi_/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward
mean, inv_std = self._sync_master.run_master(ChildMessage(input_sum, input_ssum, sum_size))
File "/mnt/data-2/data/cnn_multi
/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master
results = self.master_callback(intermediates)
File "/mnt/data-2/data/cnn_multi
/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in data_parallel_master
mean, inv_std = self.compute_mean_std(sum, ssum, sum_size)
File "/mnt/data-2/data/cnn_multi
/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in compute_mean_std
mean = sum
/ size
RuntimeError: value cannot be converted to type at::Half without overflow: 528392

@666zz666 666zz666 reopened this Feb 21, 2019
@vacancy
Copy link
Owner

vacancy commented Feb 21, 2019

Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size? Can you double check?

I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants