Move from torch.cuda.amp to torch.autocast; Add tests for amp #838

misko · 2024-09-09T16:13:26Z

This PR updates torch.cuda.amp.autocast(args...) and torch.cuda.amp.GradScaler(args...) (deprecated) and also adds CPU AMP tests. ( https://pytorch.org/docs/stable/amp.html )
Recently there was an eSCN model that did not run on GPU AMP and the current set of tests did not catch it. This PR adds those equivalent tests on CPU AMP, which will test this going forward.
Additional fixes haven been made to eSCN / SCN / Gemnet OC in this PR.

…IR-Chem/fairchem into move_to_autocast_general_and_amp_test

lbluque

Thanks @misko, mostly style suggestions

src/fairchem/core/models/equiformer_v2/layer_norm.py

lbluque · 2024-09-13T17:30:41Z

src/fairchem/core/models/escn/escn.py

+        if node_energy.device.type == "cuda":
+            energy.index_add_(0, data.batch, node_energy.view(-1))
+        else:
+            energy.index_add_(0, data.batch, node_energy.float().view(-1))


do we loose a lot of performance or use up too much memory when casting? if not then should we cast regardless?

src/fairchem/core/modules/scaling/fit.py

tests/core/e2e/test_s2ef.py

…en-Catalyst-Project/ocp into move_to_autocast_general_and_amp_test

…IR-Chem/fairchem into move_to_autocast_general_and_amp_test

codecov · 2024-09-19T01:26:25Z

Codecov Report

Attention: Patch coverage is 71.13402% with 28 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...c/fairchem/core/models/equiformer_v2/layer_norm.py	64.93%	27 Missing ⚠️
...e/models/equiformer_v2/equiformer_v2_deprecated.py	0.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
...rc/fairchem/applications/cattsunami/core/ocpneb.py	`87.12% <100.00%> (+0.09%)`	⬆️
src/fairchem/core/common/relaxation/ase_utils.py	`63.15% <100.00%> (+1.20%)`	⬆️
src/fairchem/core/models/escn/escn.py	`95.86% <100.00%> (ø)`
src/fairchem/core/models/gemnet_oc/gemnet_oc.py	`89.91% <100.00%> (ø)`
src/fairchem/core/models/scn/scn.py	`93.56% <100.00%> (ø)`
src/fairchem/core/modules/scaling/fit.py	`72.72% <100.00%> (ø)`
src/fairchem/core/trainers/base_trainer.py	`88.86% <100.00%> (+0.94%)`	⬆️
src/fairchem/core/trainers/ocp_trainer.py	`69.12% <100.00%> (+0.10%)`	⬆️
...e/models/equiformer_v2/equiformer_v2_deprecated.py	`91.15% <0.00%> (ø)`
...c/fairchem/core/models/equiformer_v2/layer_norm.py	`57.64% <64.93%> (-0.26%)`	⬇️

lbluque

lg, just minor suggestions!

lbluque · 2024-09-20T17:06:10Z

src/fairchem/applications/cattsunami/core/ocpneb.py

@@ -36,6 +36,8 @@ def __init__(
        precon=None,
        cpu=False,
        batch_size=4,
+        seed=0, # set a seed for reproducibility
+        amp=None,


why not just set amp=True as the default. From the lines bellow it looks like thats the case

lbluque · 2024-09-20T17:07:43Z

src/fairchem/applications/cattsunami/core/ocpneb.py

@@ -110,11 +112,13 @@ def __init__(
            local_rank=config.get("local_rank", 0),
            is_debug=config.get("is_debug", True),
            cpu=cpu,
-            amp=True,
+            amp=(amp==None or amp), # AMP on by default


nit, its more pythonic to check None with amp is None (its a "singleton")

lbluque · 2024-09-20T17:11:37Z

src/fairchem/core/models/equiformer_v2/layer_norm.py

-        if self.lmax > 0:
-            num_m_components = (self.lmax + 1) ** 2
-            feature = node_input.narrow(1, 1, num_m_components - 1)
+        with torch.autocast(device_type=node_input.device.type, enabled=False):


Is it possible to double up on autocast decorators like the forward method above?
I am just worried about potential incorrect indentation bugs in the future.

lbluque · 2024-09-20T17:15:33Z

src/fairchem/core/trainers/ocp_trainer.py

                out = self._forward(batch)
+            out = {k: v.float() for k, v in out.items()}


Did we decide to always cast to float32 on predictions? Thats sounds ok with me, just making sure because this is different than before.

misko added 4 commits September 3, 2024 21:27

add amp test

2e206ec

move from torch.cuda.amp to torch.autocast

29ea315

Merge branch 'add_amp_test' into move_to_autocast_general_and_amp_test

78dcae8

fix tests

5abf309

misko requested review from rayg1234 and lbluque September 9, 2024 23:30

merge with main

9a5f45b

misko marked this pull request as ready for review September 9, 2024 23:52

misko added 2 commits September 10, 2024 17:00

bump torch version

dba9fb9

Merge branch 'move_to_autocast_general_and_amp_test' of github.com:FA…

a27d29d

…IR-Chem/fairchem into move_to_autocast_general_and_amp_test

misko added enhancement New feature or request patch Patch version release labels Sep 10, 2024

misko mentioned this pull request Sep 10, 2024

bump torch and pyg #842

Closed

lbluque requested changes Sep 13, 2024

View reviewed changes

misko and others added 16 commits September 13, 2024 19:49

fix ups

6909da8

Merge branch 'move_to_autocast_general_and_amp_test' of github.com:Op…

46c1362

…en-Catalyst-Project/ocp into move_to_autocast_general_and_amp_test

merge

a212f44

xs_F is a list, use edge_idx instead

a761e19

Merge branch 'move_to_autocast_general_and_amp_test' of github.com:Op…

c4ad441

…en-Catalyst-Project/ocp into move_to_autocast_general_and_amp_test

merged

50b6f77

add calculator amp flag

f842765

lower amp tolerance on tests

b0763eb

merged

0f4de4d

Merge branch 'main' into move_to_autocast_general_and_amp_test

11cc64f

update cattsunami test

072d8d3

Merge branch 'move_to_autocast_general_and_amp_test' of github.com:Op…

104e038

…en-Catalyst-Project/ocp into move_to_autocast_general_and_amp_test

if amp and not used in checkpoint, initialize new scaler

9e472c1

run formatter

ca25c7e

Merge branch 'move_to_autocast_general_and_amp_test' of github.com:FA…

b305aa1

…IR-Chem/fairchem into move_to_autocast_general_and_amp_test

lint

8a05709

misko added 3 commits September 17, 2024 21:28

merge main

f2d5441

fix ocpneb test

9615b52

add back original output

606533e

misko added 2 commits September 19, 2024 16:53

Merge branch 'main' into move_to_autocast_general_and_amp_test

4e609fa

Merge branch 'main' into move_to_autocast_general_and_amp_test

19aaefe

misko requested a review from lbluque September 19, 2024 23:46

lbluque approved these changes Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move from torch.cuda.amp to torch.autocast; Add tests for amp #838

Move from torch.cuda.amp to torch.autocast; Add tests for amp #838

misko commented Sep 9, 2024 •

edited

Loading

lbluque left a comment

lbluque Sep 13, 2024

codecov bot commented Sep 19, 2024 •

edited

Loading

lbluque left a comment

lbluque Sep 20, 2024

lbluque Sep 20, 2024

lbluque Sep 20, 2024

lbluque Sep 20, 2024

		out = self._forward(batch)
		out = {k: v.float() for k, v in out.items()}

Move from torch.cuda.amp to torch.autocast; Add tests for amp #838

Are you sure you want to change the base?

Move from torch.cuda.amp to torch.autocast; Add tests for amp #838

Conversation

misko commented Sep 9, 2024 • edited Loading

lbluque left a comment

Choose a reason for hiding this comment

lbluque Sep 13, 2024

Choose a reason for hiding this comment

codecov bot commented Sep 19, 2024 • edited Loading

Codecov Report

lbluque left a comment

Choose a reason for hiding this comment

lbluque Sep 20, 2024

Choose a reason for hiding this comment

lbluque Sep 20, 2024

Choose a reason for hiding this comment

lbluque Sep 20, 2024

Choose a reason for hiding this comment

lbluque Sep 20, 2024

Choose a reason for hiding this comment

misko commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 19, 2024 •

edited

Loading