Skip to content

issue resuming unfinished model training cont'd #1051

@helizabeth1103

Description

@helizabeth1103

Hello,

I am following up on issue 1042. I apologize for the very delayed reply, I have been waiting for server time and did not get to test the suggested action until now.

As suggested, I changed my script to direct to what I thought was the latest checkpoint, and in the output file of the resubmission, it does say "restarted from checkpoint." However, before the first run timed out, it was about 90% completed, and one the resubmission began, it appeared to start again from zero and has an ETA comparable to the first submission. Am I directing to the wrong checkpoint? Is there additional information I need to include?

Thank you!

Here is the resubmission script:

apptainer exec --bind /90daydata,$TMPDIR --nv /90daydata/pbarc/haley.arnold/AI_ModelTraining/Samples/deepvariant_1.6.0.sif train
--config="/90daydata/pbarc/haley.arnold/AI_ModelTraining/Samples/dv_config.py:base"
--config.train_dataset_pbtxt="shuffledFM2s_chr5.with_label_channelsize.shuffled.pbtxt"
--config.tune_dataset_pbtxt="shuffledFM1s_chr5.with_label_channelsize.shuffled.pbtxt"
--config.init_checkpoint=/90daydata/pbarc/haley.arnold/AI_ModelTraining/Illumina_Tephritids/zcuc10x_modeltraining/model_chr5trainandval_newertruth_out/checkpoints
/ckpt-200000
--config.num_epochs=10
--config.learning_rate=0.02
--config.num_validation_examples=0
--experiment_dir="model_chr5trainandval_newertruth_out/"
--strategy=mirrored
--config.batch_size=32

Here is the "checkpoint" file in its current state, with the resubmission still running. At the time of resubmission, the model_checkpoint_path was "ckpt-200000" which I understood as being the best/most recent.

model_checkpoint_path: "ckpt-110032"
all_model_checkpoint_paths: "ckpt-27508"
all_model_checkpoint_paths: "ckpt-55016"
all_model_checkpoint_paths: "ckpt-200000"
all_model_checkpoint_paths: "ckpt-100000"
all_model_checkpoint_paths: "ckpt-110032"
all_model_checkpoint_timestamps: 1768907198.731081
all_model_checkpoint_timestamps: 1768963401.6893659
all_model_checkpoint_timestamps: 1769269562.6065133
all_model_checkpoint_timestamps: 1770005859.406302
all_model_checkpoint_timestamps: 1770038952.1190722
last_preserved_timestamp: 1768836072.2734745

Here is the log file from the first submission:
deepvariant_modeltrain-19001099-atlas-0245.err.txt

And here is the current log file from the ongoing resubmission:
deepvariant_modeltrain-19062375-atlas-0244.err.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions