-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design Review 1: KLIFF Trainer framework #172
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v1 #172 +/- ##
==========================================
- Coverage 63.14% 55.43% -7.71%
==========================================
Files 50 56 +6
Lines 4718 5401 +683
==========================================
+ Hits 2979 2994 +15
- Misses 1739 2407 +668 ☔ View full report in Codecov by Sentry. |
kliff/dataset/dataset.py
Outdated
@@ -630,10 +638,37 @@ def _read_from_colabfit( | |||
logger.error(f"{colabfit_dataset} is either empty or does not exist") | |||
raise DatasetError(f"{colabfit_dataset} is either empty or does not exist") | |||
|
|||
if isinstance(weight, Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use if not isinstance(weight, Weight):
to avoid the case a string being passed in as a Path.
Alternatively, we can add a utility function to utils.py
to convert to Path like this one:
def to_path(path):
return Path(path).expanduser().resolve()
I prefer the later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weight setting method is modified, so this comment might not be valid anymore
kliff/dataset/dataset.py
Outdated
@@ -730,9 +779,37 @@ def _read_from_path( | |||
parent = path.parent | |||
all_files = [path] | |||
|
|||
if isinstance(weight, Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weights reading part is the same as what was used above. So, let's create a function for it.
kliff/trainer/kliff_trainer.py
Outdated
from kliff.transforms.parameter_transforms import ParameterTransform | ||
from kliff.transforms.property_transforms import PropertyTransform | ||
|
||
from ..dataset.weight import Weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably OK to use relative import from the current directory, but let's change to use absolute import for all other cases like this one ..dataset.weight
f"Optimizable parameters must be string or value dict. Got {input_params} instead." | ||
) | ||
|
||
def setup_optimizer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be implemented as get_optimizer()
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup optimizer was chosen as it might be more involved, like setting up ema, lr decay etc. I have no preference either way.
Sorry for the late reply!
I cannot remember. Better check with Ryan. I only have major comments for the trainer.
Trainer:
def __init__(model: Union[KIMModel, TorchModel, TarModel], transform: Transform, training_config: dict):
def fit(train_loader, val_loader):
def test(test_loader):
# other necessary functions such as save kim model As you can see, this is again modeling after lightning. By moving the instantiation of the model, dataset/loader, transform, out of the Trainer can make it much more simper, which should be easier for maintaining.
|
I can look into the hydra library, but for datasets, models etc, do you think it will be better to have different utility functions etc to setup datasets and models? Whole idea is to make sure that using this single yaml file you should be able to get same results in term of dataset splits etc. |
Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.
I guess, we can provide these low-level parts to provide more granularity - these would be extremely useful of one wants to build a new trainer for a new model and such. But at the same time, we can use the low-level functionality to achieve one-yaml-file style setups to achieve reproducibility and such. This can be done, e.g. via an example training script, that first instantiates all the components, e.g. dataset and model, and then passes these to the trainer to train the model and log metrics and such. I guess the major difference is instantiating the classes inside the Trainer or outside. I think the latter is slightly better. |
I have mixed feelings with hydra -- it can simplify configs management, but tracking and debugging can be a bit more involved. I am not sure whether we want to use it here. Probably you explore it a bit and we can then decide. But even if we don't use hydra, I think we still need to use OmegaConf to read, merge, write config files, instead of the vanilla yaml module. |
This condition is already satisfied to some extent. The configuration is split into 6 independent blocks,
I guess I can try splitting it into 3 modules,
Optimizer can be a separate module but it would be difficult to organize as a separate module owing to scipy's different API. Can make a private optimizer module for torch. Thoughts, suggestions? Reason for above division:
|
Yes, the previous yaml file does serve the purpose of separating them into different parts. We just need to organize the internal codes in that direction a bit. I agree with the three-module approach. Particularly, we want to integrate the optimizer with a trainer for the exact reason you've mentioned. Also agreed that workspace settings need to be separate, and not associated with a trainer. |
Note on glossary, everywhere the configuration dictionary is called "manifest", as configuration was becoming too confusing, example "dataset_from_configuration" creates a confusion on whether it uses the Regarding Omegaconf, I did integrate it, but have to remove it afterwards as it adds lot of needless complexity with no clear advantage. Biggest issue was that it only support "primitive" dataypes in python, something as basic as numpy array needs custom solutions etc. dict is flexible enough to be a better solution I think. There might be some rough edges, that will clear as we add more functionality like the ML trainer. |
Disregard for the time being, I need to work a bit more on this before opening the PR again. |
Idea: A coherent fully reproducible framework to train all supported openkim models (including ML ones) and be able to archive exact training methodology for reuse and be available from openkim.
So easiest is to create a yaml file which can be hashed to ensure integrity and this hash can be included in
kimspec.edn
etc for provenance. This PR request contains first draft of this framework, along with its implementation for KIM physics based models. I think better to have a look at it now before having to review huge changes at once. Example yaml file is also there at the bottom.Core changes and contributions:
Dataset
: dataset object can now load weights from a file (I think this was also there on one of the todo list). the weights file has to be a 4 column whitespaced ascii file, that numpy can directly load. The 4 columns represent the configuration, energy, forces and stress weights. The length of the file has to be either 1 (all weights are same) or n (where n is the number of configurations the weights are to be set). In both ml and kim based training, these per configuration weights will be used._exceptions
: This is just a proposal, and not very strong one at that, but I think it might be useful to collect all of the file based exceptions, and collect them under single file. This gives the flexibility to use them anywhere in the KLIFF more freely. Currently we might run in circular imports if two related modules want to use same exception. (Example Trainer base class and KIMtrainer should raise Trainer error). Let me know your thoughts.kim_residuals
: simple collection of most used loss functions. Another idea is to save the pickled version of the loss function for reproducibility.kliff_trainer
: This is the base class for trainer framework. All trainers must inherit this class. This includes some basic functionality:a. Initializing the class members from yaml / dict
b. Seed all modules, numpy, scipy, torch etc
c. Setup workspace folder and model folders etc where each independent run would be automatically timestamped and saved
d. setup dataset, load the data, apply any transformation of properties, like energy normalization, compute fingerprint objects, hash the dataset based on path, transforms and weights, save the processed dataset as dill pickle object. Now when the next time any trainer need same dataset configuration, it can be reloaded directly.
e. setup test train split, either based on indices files, or ranges etc, split the dataset in test data and train data
f. call trainer specific functions,
setup_model
,setup_parameter_transforms
,setup_optimizer
g. hash and archive the training configuration
kim_trainer
: First example of subclassing this trainer class for KIM physics based models. It basically extendssetup_model
,setup_parameter_transforms
,setup_optimizer
, andsave_kim_model
Tar model: KIM model tarred in a single file. For future proofing against LLNL/KIMKit.
Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.
Future direction: I would like to have a Trainer eventually for each model that can be submitted to the openkim. for v1 we will have Torch and KIM. The Torch trainer will also have a companion Lightning based trainer for v1.
But for future versions I think it would make sense if we have Trainers that wrap around library like QUIP to provide complete training surface for newer QUIP driver based models and so on. I would also like to have Trainer built around bootstrapping and MCMC sampling that yonathan contributed, but it will take a little more time hence perhaps v1.2 etc. For v1 the UC part will remain unchanged.
Let me know of thoughts and questions. Sorry for huge wall of text, but I think had I delayed it more, it would only have increased in size!
Example yaml file: