Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Softmax log backward : Increase precision of fp16's accumulator to fp32 #3427

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

bghimireamd
Copy link
Contributor

@bghimireamd bghimireamd commented Dec 6, 2024

The accumulator of fp16 Softmax log for backward was fp16 too, this caused precision issue in onnx's test. This PR increases the accumulator of fp16 Softmax log backward to fp32. I also initialized the y for softmax backward.

old code : fp16 += fp16*fp16
new code: fp32 += fp16*fp16

Performance of fp32 vs fp16:
./bin/MIOpenDriver softmaxfp16 -n 128 -c 1 -H 1 -W 1500 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
fp16 accumulator average of 4 runs : 0.013351 ms
fp32 accumulator average of 4 runs : 0.013316 ms

./bin/MIOpenDriver softmaxfp16 -n 128 -c 32 -H 150 -W 150 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
fp16 accumulator average of 4 runs : 2.794 ms
fp32 accumulator average of 4 runs : 2.783 ms

I did not see performance degradation after changing the accumulator to fp32.

Precision of fp32 vs fp16
fp16 Acuumulator
MIOpenDriver softmaxfp16 -n 128 -c 1 -H 1 -W 1500 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
PRNG seed: 12345678
GPU Kernel Time Backward Softmax Elapsed: 0.013422 ms
Backward Softmax Verifies on CPU and GPU (err=0.000235)

fp32 Acuumulator
MIOpenDriver softmaxfp16 -n 128 -c 1 -H 1 -W 1500 -F 2 -a 2 -m 0 -A 1.000000 -B 0.000000 -t 1
PRNG seed: 12345678
GPU Kernel Time Backward Softmax Elapsed: 0.013102 ms
Backward Softmax Verifies on CPU and GPU (err=0.000022)

precision fp16 vs fp32
0.000235 vs 0.000022

 - make accumulator of fp16 as fp32 to increase the precision
@BradPepersAMD
Copy link
Collaborator

It doesn't look like this really is making much of a difference in the accuracy. Does the original ticket mention what accuracy is expected? One thing I see in the code is that it looks like LogAddExp is still always using half right now?

@bghimireamd
Copy link
Contributor Author

bghimireamd commented Dec 6, 2024

It doesn't look like this really is making much of a difference in the accuracy. Does the original ticket mention what accuracy is expected? One thing I see in the code is that it looks like LogAddExp is still always using half right now?

Sorry I forgot to place unit. The numbers I mention is in ms. I did not change LogAddExp since it was just being used in forward case. The precision issue was seen in backward. After I ported my change on onnx's docker I was able to see the test pass.

The original ticket does mention tolerance. idx: 37504 expected 0.23584, got 0.288086, diff: 0.0522461, tol=0.021792

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 7, 2024

It doesn't look like this really is making much of a difference in the accuracy. Does the original ticket mention what accuracy is expected? One thing I see in the code is that it looks like LogAddExp is still always using half right now?

Since it accumulates the numbers, using FP16 as accumulator can be significantly affected by precision loss. Trying to accumulate 4096 times by 1 will result into 2048. FP16 starts ignoring relatively small values quite fast, a way much faster than FP32.

src/kernels/MIOpenSoftmax.cl Outdated Show resolved Hide resolved
@bpepers-me
Copy link
Contributor

With this ticket being about precision, can you post what the error was before and after this change so we can see how much the precision has improved?

@bghimireamd
Copy link
Contributor Author

With this ticket being about precision, can you post what the error was before and after this change so we can see how much the precision has improved?

added precision numbers in PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants