forked from Dao-AILab/flash-attention
-
Notifications
You must be signed in to change notification settings - Fork 75
Open
Labels
Description
Problem Description
I'm testing a llm training script with 8 MI300X GPUs but the training failed with weird spiking problem followed by NaN issues. This is likely to be a hardware or more fundamental level code issue since the problem reappears at the same point no matter the restart.
As mentioned in the title, I went back to config the docker images, and found lots of test failed. The problem can be reproduced with following command/dockerfile:
FROM docker.io/rocm/pytorch:rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.5.1
WORKDIR /workspace
RUN mkdir /scratch0
# limit the number of CPUs in the container, otherwise libgomp error.
ENV OMP_NUM_THREADS=4
ENV TORCH_NUM_THREADS=4
RUN apt-get -y update
#other python packages.
RUN pip install pytorch-lightning tqdm numpy biopython pandas matplotlib einops ninja packaging numba scipy
RUN git clone https://github.com/ROCm/flash-attention.git &&\
cd flash-attention &&\
GPU_ARCHS=gfx942 python setup.py install
# set working dir
WORKDIR /workspace/flash-attention
pytest tests/test_flash_attn_ck.py
Operating System
Ubuntu
CPU
AMD EPYC 9654 96-Core Processor
GPU
MI300X
ROCm Version
6.4.0
ROCm Component
No response
Steps to Reproduce
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Reactions are currently unavailable