Has anyone tried deploying Flash attention v2 with a BERT model on the NVIDIA Triton Inference Server? #6471

loredunk · 2023-10-23T03:35:44Z

loredunk
Oct 23, 2023

I'm not sure how to enable BERT with flash attention during the start-up of the Triton server in order to accelerate inference.

MatthieuToulemont · 2023-11-21T16:25:14Z

MatthieuToulemont
Nov 21, 2023

It's probably not what you are looking for, but compiling your model to TensorRT will use Flash Attention.
Another solution if you use BLS model to run your model in python is to install torch>2 (assuming you use torch) and use scaled_dot_product_attention

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has anyone tried deploying Flash attention v2 with a BERT model on the NVIDIA Triton Inference Server? #6471

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Has anyone tried deploying Flash attention v2 with a BERT model on the NVIDIA Triton Inference Server? #6471

loredunk Oct 23, 2023

Replies: 1 comment

MatthieuToulemont Nov 21, 2023

loredunk
Oct 23, 2023

MatthieuToulemont
Nov 21, 2023