Updated Model Training #42

wendywangwwt · 2024-08-14T17:12:39Z

This PR includes a couple of major updates for model training:

Implementation of Attention UNet compatible with the current framework: now attention UNet is available to be used as a network architecture (unet_512_attention)
Validation loss and metrics calculation during training: with flag --with-val, validation losses (same types of loss as training) and metrics (cell count metrics through postprocess function) can be calculated as training goes, and the corresponding support in visdom visualizer is also implemented
- At the moment cell count metrics are only calculated for DeepLIIF models.

Others

(cli.py) Allowed specification of generator arch for each individual generator in order (accept comma-separated configuration)
(cli.py) Debug mode is now available for model training. Use it by passing --debug in python cli.py train. Change the approximate number of steps/images to run per epoch for debug mode with --debug-data-size (default to --debug-data-size 10). This helps to quickly check if the training runs as expected.
Allowed to return generated segmentation output from each individaul modality (can be accessed from infer_modalities())
Added test cases for training (--optimizer, --net-g, --net-gs, --with-val) and trainlaunch (gpu test cases only)

Notes:

Files needed for with-val mode:
i) val images, same format as training images
ii) ground truth cell count metrics in json: this can be achieved by running get_cell_count_metrics():

from deepliif.stat import get_cell_count_metrics

dir_img = 'Datasets/Sample_Dataset/val'
get_cell_count_metrics(dir, model='DeepLIIF', tile_size=512)

The code generates the metrics.json file for the validation data under the same directory as the images.

To run multiple tests in parallel (e.g., run latest/ext/sdg at the same time), make sure to use different tmp directory in --basetemp, so that the pytest processes will not delete or modify a temp folder created or used by another process. For example:

pytest -v -s --basetemp=../tmp/latest --model_type latest 2>&1 | tee ../log/pytest_latest_20240808.log
pytest -v -s --basetemp=../tmp/ext --model_type ext 2>&1 | tee ../log/pytest_ext_20240808.log
pytest -v -s --basetemp=../tmp/sdg --model_type sdg 2>&1 | tee ../log/pytest_sdg_20240808.log

… for val during training and the latter for batchsize change during val/test)

…tric calculation

…ctions

…s calculation

…dd --optimizer in cli.py train

…calculation during training (after which the stats needs to be re-enabled); added debug mode for cli.py train; added a flag in cli.py train to enable validation loss calculation; allowed to specify epoch in cli.py serialize

…nconsistent inference results

…s commits); moved functions used only for tiff file to the bottom

…mizer class supported in torch.optim

…ing issue; moved val data loading to with_val condition

…ed calculate_losses method; always use opt.lr in optimizers; set cell count metrics in validation only for deepliif

… data

…idation; removed duplicated function; fixed syntax error

…epliif

… in training tests to highlight test params

wendywangwwt · 2024-09-04T15:56:56Z

Test environment:

py 3.9
pytorch 2.4

All tests passed. Ran ext tests for twice and I did not see GPU OOM failure. Test logs are in onedrive folder DeepLIIF PR#42 attachments.

wendywangwwt added 30 commits January 31, 2024 15:12

updated train.py to reflect changes in cli.py train & trainlaunch

59d4140

updated trainlaunch for single node multi gpu

2f808f7

added auto-determination of tile size for data augementation

04db7ce

allowed change of subfolder & batchsize when loading data (the former…

d0033f4

… for val during training and the latter for batchsize change during val/test)

updated segmentation metrics calculation for deepliifext

4aa5194

added resizeconv, currently inferred from model name / dir

bcd4d92

fixed case when SDG has only 1 input modality

c03245b

added an experimental submodule to facilitate model evaluation and me…

a497cf4

…tric calculation

added tv loss

20ee9a0

disabled TV loss

507c70c

added experimental model: attention unet

08a767c

allowed to change net_gs

aca2060

added argument return_seg_intermediate

7a9c30c

allowed cli.py inference and test.py to return raw segmentation predi…

9aebf56

…ctions

allowed to specify modality-wise generator arch; added validation los…

4cde804

…s calculation

added schedule-free adamw optimizer

6d9ca56

fixed schedule-free optimizer setup (to not use scheduler wrapper); a…

952a660

…dd --optimizer in cli.py train

minor fixes for options

70a0456

forced cli.py serialize to load models to gpu0 if device=gpu to fix i…

0051acf

…nconsistent inference results

updated batchnorm disable and enable functions (missed in the previou…

f9a8114

…s commits); moved functions used only for tiff file to the bottom

merged latest commits from main-update-training

55729a7

removed changes related to stats calculation

a2119cd

removed schedule-free optimizers; allowed --optimizer to use any opti…

e437283

…mizer class supported in torch.optim

resolved merge conflict

bcf69bb

added code lines missed during merge

da6ebc8

changed netg and netgs configuration from list to tuple to avoid pars…

661bbb2

…ing issue; moved val data loading to with_val condition

changed to use list in config to standardize

cbf8c81

allowed SDG to use the same postprocess steps as DeepLIIFExt

7f4a3cf

completed validation implementation for deepliifext and sdg, with add…

fb75f20

…ed calculate_losses method; always use opt.lr in optimizers; set cell count metrics in validation only for deepliif

wendywangwwt added 12 commits August 14, 2024 16:35

added test cases for training

8cc668b

added fake validation set (same as training set :))

5a277ed

added method to calculate the cell count metrics for the ground truth…

edab58d

… data

minor fixes: disabled marker image in postprocess in cli.py train val…

38a4f9e

…idation; removed duplicated function; fixed syntax error

allowed attunet to use custom input output channel size for ext and sdg

0af813d

minor fixes: removed print logs

b18e138

updated train.py and trainlaunch; minor fixes: number of netgs for de…

7c5cb30

…epliif

added tests for trainlaunch; changed how the test commands are formed…

0df24a8

… in training tests to highlight test params

cleaned up sdg model

136540e

conditional execution of metrics.json loading

95698ae

added changeds in the previous commit to train.py as well

60d600e

fixed match_suffix function for deepliifext/sdg

0eba7b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Model Training #42

Updated Model Training #42

wendywangwwt commented Aug 14, 2024 •

edited

Loading

wendywangwwt commented Sep 4, 2024

Updated Model Training #42

Are you sure you want to change the base?

Updated Model Training #42

Conversation

wendywangwwt commented Aug 14, 2024 • edited Loading

wendywangwwt commented Sep 4, 2024

wendywangwwt commented Aug 14, 2024 •

edited

Loading