Skip to content

pt_scripts_en

ymcui edited this page Jan 28, 2024 · 3 revisions

Pre-training Scripts(待更新)

⚠️Important Reminder⚠️

  • This code is only applicable to a specific PEFT version. Please install the PEFT with the commit id 13e53fc from the source code here. We cannot guarantee that the model can be trained normally with other versions of PEFT.

  • Make sure to pull the latest version of the repository before running: git pull

Training Steps

Training script: scripts/training/run_clm_pt_with_peft.py

Go to the scripts/training directory of the project and run bash run_pt.sh to fine-tune the instructions. Single card is used by default. Before running, users should modify the script and specify relevant parameters. The parameter values in the script are for debugging reference only. The content of run_pt.sh is as follows:

########Parameter settings########
lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=path/to/hf/llama-2/dir
chinese_tokenizer_path=path/to/chinese/llama-2/tokenizer/dir
dataset_dir=path/to/pt/data/dir
data_cache=temp_data_cache_dir
per_device_train_batch_size=1
training_steps=100
gradient_accumulation_steps=1
output_dir=output_dir
block_size=512

deepspeed_config_file=ds_zero2_no_offload.json

########Launch command########
torchrun --nnodes 1 --nproc_per_node 1 run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --do_train \
    --seed $RANDOM \
    --fp16 \
    --max_steps ${training_steps} \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 500 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size ${block_size} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --modules_to_save ${modules_to_save} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --save_safetensors False \
    --load_in_kbits 16 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False

The explanation of parts of the parameters is as follows:

  • --dataset_dir: Directory of pre-training data, which can contain multiple plain text files ending with txt
  • --data_cache_dir: Specify a directory for storing data cache files
  • --use_flash_attention_2: FlashAttention-2 training enabled
  • --load_in_kbits: The selectable options are [16,8,4], which means using fp16 or 8-bit/4-bit quantization for model training. The default is fp16 training. The other listed training-related hyperparameters, especially the learning rate and parameters related to the total batch size, are for reference only. Please configure them according to the data situation and hardware conditions in actual use.

Supported Training Modes

【Must be carefully checked】 Below are the training modes supported by the script. Please pass model_name_or_path according to the corresponding situation. In this project, LLaMA-2 model and Alpaca-2 model use the same tokenizer, and no distinction is made. Modes not listed in the table are not supported. If you want to make modifications, please debug by yourself.

Purpose model_name_or_path tokenizer_name_or_path Final model vocabulary size
Train Chinese LLaMA-2 LoRA based on original LLaMA-2 Original HF format LLaMA-2 Chinese LLaMA-2's tokenizer (55296) 55296
Continue pre-training on new LoRA based on Chinese LLaMA-2 HF format complete Chinese LLaMA-2 Chinese LLaMA-2's tokenizer (55296) 55296
Continue pre-training on new LoRA based on Chinese Alpaca-2 HF format complete Chinese Alpaca-2 Chinese LLaMA-2's tokenizer (55296) 55296

Tips to Save Memory

  • If your machine's memory is tight, you can remove --modules_to_save ${modules_to_save} \ from the script, i.e., do not train embed_tokens and lm_head (these two parts have a large number of parameters), only train LoRA parameters.
    • This operation can only be performed when training based on Chinese LLaMA-2 or Alpaca-2
  • Reducing block_size can also reduce memory usage during training, such as setting block_size to 256.
  • Enabling gradient_checkpointing can effectively reduce VRAM usage, but it may slow down the training speed.

Multi-machine Multi-card Training

Please refer to the following launch method:

torchrun \
  --nnodes ${num_nodes} \
  --nproc_per_node ${num_gpu_per_node} 
  --node_rank ${node_rank} \
  --master_addr ${master_addr} \
  --master_port ${master_port} \
  run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    ...

Post-training File Organization

The LoRA weights and configuration after training are stored in ${output_dir}/pt_lora_model, which can be used for subsequent merging processes.