Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperparam tuning #54

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Hyperparam tuning #54

wants to merge 15 commits into from

Conversation

tpritsky
Copy link
Collaborator

Hi guys, I've made some edits to hyperparameter tuning with raytune. Would love to get your feedback.

Some notes:

  1. There is a risk of running out of memory since ray saves large checkpoint files in ray_results. Potentially deleting this directory between runs may help
  2. Please feel free to adjust the ranges for the hyperparameters (line 915 in search space)
  3. I had to comment out tensorboard logging due to pickling errors with raytune
  4. I had to add a home_dir path in line 778 to work with my local directory (please feel free to edit accordingly)

Copy link
Owner

@JosselinSomervilleRoberts JosselinSomervilleRoberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you paste the error with tensorboard and also which line you comment to solve the issue? Have you checked that this PR does not brake our existing build? Maybe add a parameter --ray_tune so that if we don't specify it, the behavior is the same as currently (I'm a bit skeptical regarding the new parameters).

Also could you give us some instructions on how to install raytune (It seems like you struggled, so I would love to get your opinion on how to do this :) )

@JosselinSomervilleRoberts JosselinSomervilleRoberts linked an issue Mar 16, 2023 that may be closed by this pull request
@tpritsky
Copy link
Collaborator Author

To address your points:

Can you paste the error with tensorboard and also which line you comment to solve the issue? Have you checked that this PR does not brake our existing build? Maybe add a parameter --ray_tune so that if we don't specify it, the behavior is the same as currently (I'm a bit skeptical regarding the new parameters).
Yes, here's the tensorboard error:
`
Serializing 'writer' <torch.utils.tensorboard.writer.SummaryWriter object at 0x7f1c93b6efd0>...
!!! FAIL serialization: cannot pickle 'tensorflow.python.lib.io._pywrap_file_io.WritableFile' object
Serializing '_annotated' FunctionTrainable...
================================================================================
Variable:

    FailTuple(writer [obj=<torch.utils.tensorboard.writer.SummaryWriter object at 0x7f1c93b6efd0>, parent=<function train_multitask at 0x7f1c93b644c0>])

was found to be non-serializable. There may be multiple other undetected variables that were non-serializable.
Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class.
`

Once I commented out all references to writer (eg. writer.add_scalar) in train_multitask and model_eval_multitask (which is called by train_multitask), the code runs fine.

Also could you give us some instructions on how to install raytune (It seems like you struggled, so I would love to get your opinion on how to do this :) )

Raytune was actually simple enough to install on AWS with pip. But here's the command:
pip install -U "ray[tune]" # installs Ray + dependencies for Ray Tune

Copy link
Collaborator

@marie-huynh marie-huynh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to refactor the tuning code in another file than the multitask_classifier in the main ?

@tpritsky
Copy link
Collaborator Author

tpritsky commented Mar 18, 2023

Hi @marie-huynh raytune serves as a wrapper function and requires the training function (train_multitask) to be modified to accept a config dictionary passed by ray and use the arguments within it. So I don't think I can run raytune without directly editing multitask_classifier.

However, I have not made functional edits to train_multitask (only editing the input argument to be a config dictionary passed by ray instead of args). Internally, I've also edited train_multitask to use arguments from this config dictionary.

Finally, I've added two new command line flags:

  • --tune_hyperparameters: if this isn't provided, training runs normally
  • --num_tuning_runs: Sets the number of hyperparameter training experiments to run

@tpritsky
Copy link
Collaborator Author

I've updated my script to avoid out of memory errors by saving directly to the disk rather than RAM. As long as you update your disk size, it should run ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Hyperparameter tuning
3 participants