Skip to content

[Issue]: FA2 test failed when build CK/Triton backend on top of a rocm official docker image. #143

@DerienFe

Description

@DerienFe

Problem Description

I'm testing a llm training script with 8 MI300X GPUs but the training failed with weird spiking problem followed by NaN issues. This is likely to be a hardware or more fundamental level code issue since the problem reappears at the same point no matter the restart.

As mentioned in the title, I went back to config the docker images, and found lots of test failed. The problem can be reproduced with following command/dockerfile:

FROM docker.io/rocm/pytorch:rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.5.1

WORKDIR /workspace
RUN mkdir /scratch0


# limit the number of CPUs in the container, otherwise libgomp error.
ENV OMP_NUM_THREADS=4
ENV TORCH_NUM_THREADS=4

RUN apt-get -y update
#other python packages.
RUN pip install pytorch-lightning tqdm numpy biopython pandas matplotlib einops ninja packaging numba scipy


RUN git clone https://github.com/ROCm/flash-attention.git &&\
    cd flash-attention &&\
    GPU_ARCHS=gfx942 python setup.py install


# set working dir
WORKDIR /workspace/flash-attention

pytest tests/test_flash_attn_ck.py

Operating System

Ubuntu

CPU

AMD EPYC 9654 96-Core Processor

GPU

MI300X

ROCm Version

6.4.0

ROCm Component

No response

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions