Fix nans in gradient from inverse softplus #123

SimonKamuk · 2025-02-11T20:21:10Z

Describe your changes

Fix a bug where the inverse softplus causes gradients to become nans.

I believe the issue was that even though the non_linear_part had clamping inside, and the values were later being discarded if they were above the threshold, the input x values could still cause numerical instabilities before we checked if the input was above the threshold.

What I did to fix it was to clamp the input x values. The limits were previously defined as $exp(x\cdot beta)-1 \ge 10^{-6}$ and $x\cdot beta \le threshold$, so I changed it to the equivalent $log(10^{-6}+1)/beta \le x \le threshold/beta$.

Issue Link

closes #119

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

SimonKamuk · 2025-02-11T20:28:41Z

Since our tests now use the reduced danra domain and are relatively quick, what do you think about adding detect_anomaly=True to the pl.Trainer in test_training.py ? That would prevent an issue like this from getting through (assuming the training config used for the test includes any such future features). Is it a bit overkill?

sadamov

Tested this with two different datasets and ar_steps_train, solves the bug. Thanks Simon

joeloskarsson · 2025-02-12T10:26:17Z

I think it would be a good idea to add detect_anomaly=True to the tests

Fix nans in gradient from inverse softplus

86e3830

SimonKamuk added the bug Something isn't working label Feb 11, 2025

SimonKamuk added this to the v0.4.0 milestone Feb 11, 2025

SimonKamuk requested review from joeloskarsson and sadamov February 11, 2025 20:21

SimonKamuk self-assigned this Feb 11, 2025

Update changelog

2dd50e0

sadamov approved these changes Feb 12, 2025

View reviewed changes

use new aws runner with more storage

16cfcac

SimonKamuk merged commit 103c8b7 into mllam:main Feb 12, 2025
8 checks passed

SimonKamuk deleted the fix/clamping_nan_grads branch February 12, 2025 08:48

SimonKamuk mentioned this pull request Feb 12, 2025

add detect_anomaly in test_training.py #124

Merged

20 tasks

leifdenby mentioned this pull request Feb 17, 2025

use nvme store for venv on aws-gpu with pip #126

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nans in gradient from inverse softplus #123

Fix nans in gradient from inverse softplus #123

SimonKamuk commented Feb 11, 2025 •

edited by sadamov

Loading

SimonKamuk commented Feb 11, 2025

sadamov left a comment

joeloskarsson commented Feb 12, 2025

Fix nans in gradient from inverse softplus #123

Fix nans in gradient from inverse softplus #123

Conversation

SimonKamuk commented Feb 11, 2025 • edited by sadamov Loading

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

SimonKamuk commented Feb 11, 2025

sadamov left a comment

Choose a reason for hiding this comment

joeloskarsson commented Feb 12, 2025

SimonKamuk commented Feb 11, 2025 •

edited by sadamov

Loading