Hello,
I am following up on issue 1042. I apologize for the very delayed reply, I have been waiting for server time and did not get to test the suggested action until now.
As suggested, I changed my script to direct to what I thought was the latest checkpoint, and in the output file of the resubmission, it does say "restarted from checkpoint." However, before the first run timed out, it was about 90% completed, and one the resubmission began, it appeared to start again from zero and has an ETA comparable to the first submission. Am I directing to the wrong checkpoint? Is there additional information I need to include?
Thank you!
Here is the resubmission script:
apptainer exec --bind /90daydata,$TMPDIR --nv /90daydata/pbarc/haley.arnold/AI_ModelTraining/Samples/deepvariant_1.6.0.sif train
--config="/90daydata/pbarc/haley.arnold/AI_ModelTraining/Samples/dv_config.py:base"
--config.train_dataset_pbtxt="shuffledFM2s_chr5.with_label_channelsize.shuffled.pbtxt"
--config.tune_dataset_pbtxt="shuffledFM1s_chr5.with_label_channelsize.shuffled.pbtxt"
--config.init_checkpoint=/90daydata/pbarc/haley.arnold/AI_ModelTraining/Illumina_Tephritids/zcuc10x_modeltraining/model_chr5trainandval_newertruth_out/checkpoints
/ckpt-200000
--config.num_epochs=10
--config.learning_rate=0.02
--config.num_validation_examples=0
--experiment_dir="model_chr5trainandval_newertruth_out/"
--strategy=mirrored
--config.batch_size=32
Here is the "checkpoint" file in its current state, with the resubmission still running. At the time of resubmission, the model_checkpoint_path was "ckpt-200000" which I understood as being the best/most recent.
model_checkpoint_path: "ckpt-110032"
all_model_checkpoint_paths: "ckpt-27508"
all_model_checkpoint_paths: "ckpt-55016"
all_model_checkpoint_paths: "ckpt-200000"
all_model_checkpoint_paths: "ckpt-100000"
all_model_checkpoint_paths: "ckpt-110032"
all_model_checkpoint_timestamps: 1768907198.731081
all_model_checkpoint_timestamps: 1768963401.6893659
all_model_checkpoint_timestamps: 1769269562.6065133
all_model_checkpoint_timestamps: 1770005859.406302
all_model_checkpoint_timestamps: 1770038952.1190722
last_preserved_timestamp: 1768836072.2734745
Here is the log file from the first submission:
deepvariant_modeltrain-19001099-atlas-0245.err.txt
And here is the current log file from the ongoing resubmission:
deepvariant_modeltrain-19062375-atlas-0244.err.txt
Hello,
I am following up on issue 1042. I apologize for the very delayed reply, I have been waiting for server time and did not get to test the suggested action until now.
As suggested, I changed my script to direct to what I thought was the latest checkpoint, and in the output file of the resubmission, it does say "restarted from checkpoint." However, before the first run timed out, it was about 90% completed, and one the resubmission began, it appeared to start again from zero and has an ETA comparable to the first submission. Am I directing to the wrong checkpoint? Is there additional information I need to include?
Thank you!
Here is the resubmission script:
Here is the "checkpoint" file in its current state, with the resubmission still running. At the time of resubmission, the model_checkpoint_path was "ckpt-200000" which I understood as being the best/most recent.
Here is the log file from the first submission:
deepvariant_modeltrain-19001099-atlas-0245.err.txt
And here is the current log file from the ongoing resubmission:
deepvariant_modeltrain-19062375-atlas-0244.err.txt