resume_from_checkpoint 学习率的恢复 #288
Replies: 3 comments
-
能把你脚本贴上来看一下吗 |
Beta Was this translation helpful? Give feedback.
-
运行脚本前请仔细阅读wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh)Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh) carefully before running the scriptlr=2e-4 pretrained_model=/remote-home/lvliuzh/llama/models_hf/7B ckpt=/remote-home/share/uniref50/ckpts_CN/checkpoint-1000/ export NCC_P2P_LEVEL=NVL |
Beta Was this translation helpful? Give feedback.
-
我发现在deepspeed的config文件中加入如下代码,能解决这个问题,要修改"last_bach_iteration"。但是我看很多代码仓库,都没有设置这个选项,不知道为啥。我如果不这么设置,那学习率就接续不上。 |
Beta Was this translation helpful? Give feedback.
-
在1000 steps时,我保存了checkpoint-1000的检查点,使用resume_from_checkpoint恢复训练。
因为有lr_scheduler的存在,学习率应该从1000 steps时的值 继续变化。
但实际上,学习率又从头(也就是step=0的时候)开始变化了。
有人遇到这个问题吗?
Beta Was this translation helpful? Give feedback.
All reactions