-
Notifications
You must be signed in to change notification settings - Fork 114
Open
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
When I was training Eagle3 Qwen2.5 VL, the parameter build-dataset-num-proc in the training script had to be 0; otherwise, a deadlock would be triggered, causing the map process to get stuck.
Reproduction
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_eagle3.py \
--target-model-path Qwen/Qwen2.5-VL-7B-Instruct \
--target-model-backend hf \
--draft-model-config $ROOT_DIR/configs/qwen2-5-vl-eagle3.json \
--build-dataset-num-proc 8 \
--train-data-path $ROOT_DIR/cache/dataset/allava4v_train.jsonl \
--output-dir $ROOT_DIR/outputs/Qwen2.5-VL-7B-eagle3 \
--num-epochs 10 \
--batch-size 1 \
--learning-rate 1e-4 \
--max-length 8192 \
--dist-timeout 360 \
--chat-template qwen2-vl \
--cache-dir $ROOT_DIR/cache \
--embedding-key model.embed_tokens.weight \
--tp-size 1 \
--is-vlm \
--min-pixels 50176 \
--max-pixels 802816
Environment
sglang 0.5.3
Metadata
Metadata
Assignees
Labels
No labels