Replies: 1 comment
-
It's probably not what you are looking for, but compiling your model to TensorRT will use Flash Attention. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm not sure how to enable BERT with flash attention during the start-up of the Triton server in order to accelerate inference.
Beta Was this translation helpful? Give feedback.
All reactions