feat(bench): Add pipeline FlashAttention-2 implementation. #23

KuangjuX · 2024-12-21T15:45:36Z

This is a basic version of the pipelined FlashAttention-2 implementation, and I would like to first merge these changes into the master branch.

The current version of FlashAttention has the following features:

Pipeline technology has been adopted to create a multi-level cache for shared memory, allowing the use of upper-level caches during async_copy, which improves the utilization of the computational components(Tensor Core in Ampere Architecture).
Currently, only the case of load_q_once has been implemented, where kTK == kK. In this situation, the k dimension is not partitioned within a single SM Block, and the Q matrix only needs to be loaded once.
In FractalTensor, the N dimension is partitioned twice: once for kN in the outer loop and once for kTN in the inner loop to load the V matrix. The inner loop partitioning has not been implemented yet.
In FractalTensor, the last iteration of the outer loop for the N dimension is to be unrolled, which has not been implemented yet.
The naming of some device functions needs to be modified and organized.

The current implementation is not a final version; I will continue to add more features in subsequent PRs.

KuangjuX · 2024-12-25T10:29:37Z

@microsoft-github-policy-service agree company="Microsoft"

lcy-seso · 2024-12-29T02:40:20Z

.vscode/settings.json

+    "gotoSymbolStack.currentStackPosition": 0,
+    "gotoSymbolStack.maxStackPosition": 0,
+    "gotoSymbolStack.filePositionInfo": []
+}


I am curious as to why the pre-commit hooks (see: https://github.com/microsoft/TileFusion/blob/master/.pre-commit-config.yaml#L28) do not address these unseen characters, which are often caused by differences in IDEs. I have observed this issue several times. This hook is supposed to fix it automatically before filing a PR.

I just used pre-commit run --all-files to automatically fix the issues, but it seems that when I use Git to commit, it doesn't automatically fix all files before the pre-commit hook. I will check the reason for this issue later.

lcy-seso · 2025-01-03T03:20:11Z

benchmarks/cpp/flashattention/CMakeLists.txt

+# --------------------------------------------------------------------------
+
+cmake_minimum_required(VERSION 3.25 FATAL_ERROR)
+project(gemm_bench LANGUAGES C CXX CUDA)


the project name gemm_bench should be updated.

Oops! I forgot to make the modifications, but they have been made now.

lcy-seso · 2025-01-03T03:21:18Z

benchmarks/cpp/flashattention/CMakeLists.txt

+include_directories("${THIRD_PARTY_DIR}/cutlass/include")
+
+add_executable(flash_attn main.cu)
+target_link_libraries(flash_attn ${CUDA_CUBLAS_LIBRARIES})


Is CuBLAS utilized in this code? It doesn't appear to be. Do we need to link it?

lcy-seso

LGTM. 😊

KuangjuX marked this pull request as draft December 21, 2024 15:45

Add pipeline FlashAttention-2 implementation.

d3fccac

KuangjuX force-pushed the cutlass_flash_attn branch from 7771636 to d3fccac Compare December 28, 2024 13:59

fix merge conflicts.

b0bd2da

lcy-seso reviewed Dec 29, 2024

View reviewed changes

pre-commit fix.

49e9cf7

KuangjuX force-pushed the cutlass_flash_attn branch from b214a27 to 29f47eb Compare December 30, 2024 11:27

Add comments and fix some bugs.

e57fa5c

KuangjuX force-pushed the cutlass_flash_attn branch from 44fba6a to e57fa5c Compare January 3, 2025 02:19

Add Copyright.

710e5df

KuangjuX changed the title ~~🚧 feat(bench): Add pipeline FlashAttention-2 implementation.~~ feat(bench): Add pipeline FlashAttention-2 implementation. Jan 3, 2025

fix merge conflict.

55b68a4

KuangjuX marked this pull request as ready for review January 3, 2025 03:00

KuangjuX requested a review from lcy-seso January 3, 2025 03:00

lcy-seso reviewed Jan 3, 2025

View reviewed changes

follow comments.

f593d47

lcy-seso approved these changes Jan 3, 2025

View reviewed changes

lcy-seso merged commit b586a02 into microsoft:master Jan 3, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): Add pipeline FlashAttention-2 implementation. #23

feat(bench): Add pipeline FlashAttention-2 implementation. #23

KuangjuX commented Dec 21, 2024 •

edited

Loading

KuangjuX commented Dec 25, 2024

lcy-seso Dec 29, 2024

KuangjuX Dec 29, 2024

lcy-seso Jan 3, 2025

KuangjuX Jan 3, 2025

lcy-seso Jan 3, 2025

KuangjuX Jan 3, 2025

lcy-seso left a comment

feat(bench): Add pipeline FlashAttention-2 implementation. #23

feat(bench): Add pipeline FlashAttention-2 implementation. #23

Conversation

KuangjuX commented Dec 21, 2024 • edited Loading

KuangjuX commented Dec 25, 2024

lcy-seso Dec 29, 2024

Choose a reason for hiding this comment

KuangjuX Dec 29, 2024

Choose a reason for hiding this comment

lcy-seso Jan 3, 2025

Choose a reason for hiding this comment

KuangjuX Jan 3, 2025

Choose a reason for hiding this comment

lcy-seso Jan 3, 2025

Choose a reason for hiding this comment

KuangjuX Jan 3, 2025

Choose a reason for hiding this comment

lcy-seso left a comment

Choose a reason for hiding this comment

KuangjuX commented Dec 21, 2024 •

edited

Loading