Kliff master v1 lightning #182

ipcamit · 2024-06-18T15:34:19Z

Summary

Added base trainer and lightning trainer. Along with new tests to test the trainer module

Additional dependencies introduced (if any)

torch lightning for lightning trainer in ["test"]

TODO (if any)

Add KIM Trainer next

Checklist

Before a pull request can be merged, the following items must be checked:

Make sure your code is properly formatted. isort and black are used for this purpose. The simplest way is to use pre-commit. See instructions here.
Doc strings have been added in the Google docstring format on your code.
Type annotations are highly encouraged. Run mypy to
type check your code.
Tests have been added for any new functionality or bug fixes.
All linting and tests pass.

Note that the CI system will run all the above checks. But it will be much more
efficient if you already fix most errors prior to submitting the PR.

…r stress in training loss

…ckpt resume capabilities + docstrings

.gitignore

kliff/dataset/dataset.py

mjwen · 2024-06-21T03:58:54Z

kliff/trainer/lightning_trainer.py

+        optimizer_name: Name of the optimizer to use. Default is "Adam"
+        lr: Learning rate for the optimizer. Default is 0.001
+        energy_weight: Weight for the energy loss. Default is 1.0
+        forces_weight: Weight for the forces loss. Default is 1.0


how about stress_weight

TODO: once the stress issue is fixed.

kliff/trainer/lightning_trainer.py

mjwen · 2024-06-21T04:27:22Z

kliff/trainer/lightning_trainer.py

+            + forces_weight * torch.mean(per_atom_force_loss) / 3
+        )  # divide by 3 to get correct MSE
+
+        self.log(


Besides a sum of the loss, people would typically be more interested in separate losses on energy and forces.

Also, people can also be interested in metrics like MAE other than MSE.

Most importantly, it is not interesting to know losses/metrics of a validation step for a batch of data. The losses/metrics at each epoch for all data matter. This requires aggregating the data between steps. I'd suggest using torchmetrics, which automatically does it.

So, we need to report separate losses, and allow other metrics, e.g. using the trochmetrics package.

Torch metrics looks great. I will add support for it.
Can you explain this a bit:

Most importantly, it is not interesting to know losses/metrics of a validation step for a batch of data. The losses/metrics at each epoch for all data matter.

Currently, the validation loss/metrics is evaluated and reported for each mini batch of data, which is only part of all the validation data. But people generally would be interested in the loss/metrics for all validation/test data, not each mini batch.

I dont think so. In logging when we give on_epoch=True and on_step=False, then I believe the logger will call after_batch_end function to simply accumulate the loss and only log it in after_epoch_end function. In the log files too I could see that validation losses are logged per epoch, and not per batch. This is similar to what I am doing in the loss traj callback, where on_validation_batch_end just gathers the result and on_validation_epoch_end writes them to a file. I can recheck the API but I am quite certain that is the case.

mjwen · 2024-06-21T04:32:16Z

@ipcamit Nothing major, but a few clarifying questions and minor tweaks.

ipcamit · 2024-06-23T03:36:00Z

Addressed most comments. Need some more time and meeting with Josh for Loss trajectory concretization. Will address the descriptor dataloader issues with torch trainer which uses the descriptor module, so it will be easier to see the design choices.

ipcamit added 25 commits February 14, 2024 19:22

Trainer module

9108450

Merge branch 'kliff-master-v1' into kliff-trainer-v1

2dc9be0

Trainer base class implemented

731c5df

Trainer base class working

86ab5d2

First draft trainer framework

2f57f56

from config functionality in KIMModel

01322ec

DS and Model manifest initialization

2fa8bb8

Moved back from omegaconf to dict

e1ef24e

Working KIM trainer module

614c1b9

Merged Eric's PR

d3050a6

Torch ml trainer added, to test

ccd5e29

working descriptor module

a121895

stress in loss function

18f9bc4

Functional non NaN torch trainer

eee1459

Functioning Lightning trainer, tested on nequip

0cbe35e

Data resume capabilities added

5dc5eee

Dynamic loading in Lightning train

e8cb5fc

Added Lightning checkpoints for model save and loss traj + prelims fo…

26edfce

…r stress in training loss

Modified dataset weights + tests to reflect

289a2a9

Cleanup + indices

1f4e6eb

implemented save kim odel

533bf2f

Checked model export

9f65157

Added restart capabilities

c494f5e

Lightning trainer updates: Added KIM-API model export capabilities + …

b7a157f

…ckpt resume capabilities + docstrings

Lightning trainer and tests

a14718b

mjwen requested changes Jun 21, 2024

View reviewed changes

Lightning trainer Comments #1

1a5812b

mjwen merged commit f63d092 into openkim:v1 Jun 24, 2024
1 of 4 checks passed

ipcamit deleted the kliff-master-v1-lightning branch June 24, 2024 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kliff master v1 lightning #182

Kliff master v1 lightning #182

ipcamit commented Jun 18, 2024

mjwen Jun 21, 2024

ipcamit Jun 21, 2024

mjwen Jun 21, 2024

ipcamit Jun 22, 2024

mjwen Jun 22, 2024

ipcamit Jun 22, 2024

mjwen commented Jun 21, 2024

ipcamit commented Jun 23, 2024

Kliff master v1 lightning #182

Kliff master v1 lightning #182

Conversation

ipcamit commented Jun 18, 2024

Summary

Additional dependencies introduced (if any)

TODO (if any)

Checklist

mjwen Jun 21, 2024

Choose a reason for hiding this comment

ipcamit Jun 21, 2024

Choose a reason for hiding this comment

mjwen Jun 21, 2024

Choose a reason for hiding this comment

ipcamit Jun 22, 2024

Choose a reason for hiding this comment

mjwen Jun 22, 2024

Choose a reason for hiding this comment

ipcamit Jun 22, 2024

Choose a reason for hiding this comment

mjwen commented Jun 21, 2024

ipcamit commented Jun 23, 2024