-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Softmax log backward : Increase precision of fp16's accumulator to fp32 #3427
base: develop
Are you sure you want to change the base?
Conversation
- make accumulator of fp16 as fp32 to increase the precision
It doesn't look like this really is making much of a difference in the accuracy. Does the original ticket mention what accuracy is expected? One thing I see in the code is that it looks like LogAddExp is still always using half right now? |
Sorry I forgot to place unit. The numbers I mention is in ms. I did not change The original ticket does mention tolerance. |
Since it accumulates the numbers, using FP16 as accumulator can be significantly affected by precision loss. Trying to accumulate 4096 times by 1 will result into 2048. FP16 starts ignoring relatively small values quite fast, a way much faster than FP32. |
With this ticket being about precision, can you post what the error was before and after this change so we can see how much the precision has improved? |
added precision numbers in PR. |
The accumulator of fp16 Softmax log for backward was fp16 too, this caused precision issue in onnx's test. This PR increases the accumulator of fp16 Softmax log backward to fp32. I also initialized the y for softmax backward.
old code :
fp16 += fp16*fp16
new code:
fp32 += fp16*fp16
Performance of fp32 vs fp16:
./bin/MIOpenDriver softmaxfp16 -n 128 -c 1 -H 1 -W 1500 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
fp16 accumulator average of 4 runs : 0.013351
ms
fp32 accumulator average of 4 runs : 0.013316
ms
./bin/MIOpenDriver softmaxfp16 -n 128 -c 32 -H 150 -W 150 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
fp16 accumulator average of 4 runs : 2.794
ms
fp32 accumulator average of 4 runs : 2.783
ms
I did not see performance degradation after changing the accumulator to fp32.
Precision of fp32 vs fp16
fp16 Acuumulator
MIOpenDriver softmaxfp16 -n 128 -c 1 -H 1 -W 1500 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
PRNG seed: 12345678
GPU Kernel Time Backward Softmax Elapsed: 0.013422 ms
Backward Softmax Verifies on CPU and GPU (err=0.000235)
fp32 Acuumulator
MIOpenDriver softmaxfp16 -n 128 -c 1 -H 1 -W 1500 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
PRNG seed: 12345678
GPU Kernel Time Backward Softmax Elapsed: 0.013102 ms
Backward Softmax Verifies on CPU and GPU (err=0.000022)
precision fp16 vs fp32
0.000235 vs 0.000022