[Issue]: FA2 test failed when build CK/Triton backend on top of a rocm official docker image.

### Problem Description

I'm testing a llm training script with 8 MI300X GPUs but the training failed with weird spiking problem followed by NaN issues. This is likely to be a hardware or more fundamental level code issue since the problem reappears at the same point no matter the restart.

As mentioned in the title, I went back to config the docker images, and found lots of test failed. The problem can be reproduced with following command/dockerfile:
```
FROM docker.io/rocm/pytorch:rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.5.1

WORKDIR /workspace
RUN mkdir /scratch0


# limit the number of CPUs in the container, otherwise libgomp error.
ENV OMP_NUM_THREADS=4
ENV TORCH_NUM_THREADS=4

RUN apt-get -y update
#other python packages.
RUN pip install pytorch-lightning tqdm numpy biopython pandas matplotlib einops ninja packaging numba scipy


RUN git clone https://github.com/ROCm/flash-attention.git &&\
    cd flash-attention &&\
    GPU_ARCHS=gfx942 python setup.py install


# set working dir
WORKDIR /workspace/flash-attention

```

pytest tests/test_flash_attn_ck.py

### Operating System

Ubuntu

### CPU

AMD EPYC 9654 96-Core Processor

### GPU

MI300X

### ROCm Version

6.4.0

### ROCm Component

_No response_

### Steps to Reproduce


### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: FA2 test failed when build CK/Triton backend on top of a rocm official docker image. #143

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: FA2 test failed when build CK/Triton backend on top of a rocm official docker image. #143

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions