Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kliff master v1 lightning #182

Merged
merged 26 commits into from
Jun 24, 2024
Merged

Conversation

ipcamit
Copy link

@ipcamit ipcamit commented Jun 18, 2024

Summary

Added base trainer and lightning trainer. Along with new tests to test the trainer module

Additional dependencies introduced (if any)

  • torch lightning for lightning trainer in ["test"]

TODO (if any)

Add KIM Trainer next

Checklist

Before a pull request can be merged, the following items must be checked:

  • Make sure your code is properly formatted. isort and black are used for this purpose. The simplest way is to use pre-commit. See instructions here.
  • Doc strings have been added in the Google docstring format on your code.
  • Type annotations are highly encouraged. Run mypy to
    type check your code.
  • Tests have been added for any new functionality or bug fixes.
  • All linting and tests pass.

Note that the CI system will run all the above checks. But it will be much more
efficient if you already fix most errors prior to submitting the PR.

.gitignore Outdated Show resolved Hide resolved
kliff/dataset/dataset.py Outdated Show resolved Hide resolved
kliff/dataset/dataset.py Outdated Show resolved Hide resolved
kliff/dataset/dataset.py Show resolved Hide resolved
kliff/dataset/dataset.py Show resolved Hide resolved
optimizer_name: Name of the optimizer to use. Default is "Adam"
lr: Learning rate for the optimizer. Default is 0.001
energy_weight: Weight for the energy loss. Default is 1.0
forces_weight: Weight for the forces loss. Default is 1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about stress_weight

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: once the stress issue is fixed.

kliff/trainer/lightning_trainer.py Outdated Show resolved Hide resolved
kliff/trainer/lightning_trainer.py Outdated Show resolved Hide resolved
kliff/trainer/lightning_trainer.py Outdated Show resolved Hide resolved
+ forces_weight * torch.mean(per_atom_force_loss) / 3
) # divide by 3 to get correct MSE

self.log(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides a sum of the loss, people would typically be more interested in separate losses on energy and forces.

Also, people can also be interested in metrics like MAE other than MSE.

Most importantly, it is not interesting to know losses/metrics of a validation step for a batch of data. The losses/metrics at each epoch for all data matter. This requires aggregating the data between steps. I'd suggest using torchmetrics, which automatically does it.

So, we need to report separate losses, and allow other metrics, e.g. using the trochmetrics package.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Torch metrics looks great. I will add support for it.
Can you explain this a bit:

Most importantly, it is not interesting to know losses/metrics of a validation step for a batch of data. The losses/metrics at each epoch for all data matter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the validation loss/metrics is evaluated and reported for each mini batch of data, which is only part of all the validation data. But people generally would be interested in the loss/metrics for all validation/test data, not each mini batch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so. In logging when we give on_epoch=True and on_step=False, then I believe the logger will call after_batch_end function to simply accumulate the loss and only log it in after_epoch_end function. In the log files too I could see that validation losses are logged per epoch, and not per batch. This is similar to what I am doing in the loss traj callback, where on_validation_batch_end just gathers the result and on_validation_epoch_end writes them to a file. I can recheck the API but I am quite certain that is the case.

@mjwen
Copy link
Collaborator

mjwen commented Jun 21, 2024

@ipcamit Nothing major, but a few clarifying questions and minor tweaks.

@ipcamit
Copy link
Author

ipcamit commented Jun 23, 2024

Addressed most comments. Need some more time and meeting with Josh for Loss trajectory concretization. Will address the descriptor dataloader issues with torch trainer which uses the descriptor module, so it will be easier to see the design choices.

@mjwen mjwen merged commit f63d092 into openkim:v1 Jun 24, 2024
1 of 4 checks passed
@ipcamit ipcamit deleted the kliff-master-v1-lightning branch June 24, 2024 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants