issue resuming unfinished model training cont'd

Hello, 

I am following up on [issue 1042](https://github.com/google/deepvariant/issues/1042). I apologize for the very delayed reply, I have been waiting for server time and did not get to test the suggested action until now. 

As suggested, I changed my script to direct to what I thought was the latest checkpoint, and in the output file of the resubmission, it does say "restarted from checkpoint." However, before the first run timed out, it was about 90% completed, and one the resubmission began, it appeared to start again from zero and has an ETA comparable to the first submission. Am I directing to the wrong checkpoint? Is there additional information I need to include? 

Thank you! 

Here is the resubmission script: 

> apptainer exec --bind /90daydata,$TMPDIR --nv /90daydata/pbarc/haley.arnold/AI_ModelTraining/Samples/deepvariant_1.6.0.sif train \
> --config="/90daydata/pbarc/haley.arnold/AI_ModelTraining/Samples/dv_config.py:base" \
> --config.train_dataset_pbtxt="shuffledFM2s_chr5.with_label_channelsize.shuffled.pbtxt" \
> --config.tune_dataset_pbtxt="shuffledFM1s_chr5.with_label_channelsize.shuffled.pbtxt" \
> --config.init_checkpoint=/90daydata/pbarc/haley.arnold/AI_ModelTraining/Illumina_Tephritids/zcuc10x_modeltraining/model_chr5trainandval_newertruth_out/checkpoints
> /ckpt-200000 \
> --config.num_epochs=10 \
> --config.learning_rate=0.02 \
> --config.num_validation_examples=0 \
> --experiment_dir="model_chr5trainandval_newertruth_out/" \
> --strategy=mirrored \
> --config.batch_size=32

Here is the "checkpoint" file in its current state, with the resubmission still running. At the time of resubmission, the model_checkpoint_path was "ckpt-200000" which I understood as being the best/most recent. 

> model_checkpoint_path: "ckpt-110032"
> all_model_checkpoint_paths: "ckpt-27508"
> all_model_checkpoint_paths: "ckpt-55016"
> all_model_checkpoint_paths: "ckpt-200000"
> all_model_checkpoint_paths: "ckpt-100000"
> all_model_checkpoint_paths: "ckpt-110032"
> all_model_checkpoint_timestamps: 1768907198.731081
> all_model_checkpoint_timestamps: 1768963401.6893659
> all_model_checkpoint_timestamps: 1769269562.6065133
> all_model_checkpoint_timestamps: 1770005859.406302
> all_model_checkpoint_timestamps: 1770038952.1190722
> last_preserved_timestamp: 1768836072.2734745


Here is the log file from the first submission: 
[deepvariant_modeltrain-19001099-atlas-0245.err.txt](https://github.com/user-attachments/files/25057697/deepvariant_modeltrain-19001099-atlas-0245.err.txt)

And here is the current log file from the ongoing resubmission: 
[deepvariant_modeltrain-19062375-atlas-0244.err.txt](https://github.com/user-attachments/files/25057708/deepvariant_modeltrain-19062375-atlas-0244.err.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue resuming unfinished model training cont'd #1051

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

issue resuming unfinished model training cont'd #1051

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions