Utilize torch.nn.functional.scaled_dot_product_attention for more performance

[nn.functional.scaled_dot_product_attention](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) is a very efficient implementation of attention.
It is way faster and a lot more memory efficient than using the naive implementation and shouldn't require any new dependencies or any changes outside the module.

From the documentation:
> There are currently three supported implementations of scaled dot product attention:
> 
> - [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691)
> - [Memory-Efficient Attention](https://github.com/facebookresearch/xformers)
> - A PyTorch implementation defined in C++ matching the above formulation
> 
> The function may call optimized kernels for improved performance when using the CUDA backend. For all other backends, the PyTorch implementation will be used.

The following snippets: 
https://github.com/Mr-Homeless/waldo/blob/30dcb6380c33980515e37855bed01bd4fdffb496/deepcheat/VideoMAEv2/models/modeling_finetune.py#L185-L191
https://github.com/Mr-Homeless/waldo/blob/30dcb6380c33980515e37855bed01bd4fdffb496/deepcheat/VideoMAEv2/models/modeling_finetune.py#L124-L135

could be simplified to:
```python
x = F.scaled_dot_product_attention(
    q,
    k,
    v, 
    scale=self.scale, 
    dropout_p=(self.attn_drop.p if self.training else 0.0)
).transpose(1, 2).reshape(B, N, -1) 
```
and
```python
x = F.scaled_dot_product_attention(
    F.normalize(q, dim=-1),
    F.normalize(k, dim=-1),
    v, 
    scale=torch.clamp(self.scale, max=4.6052).exp(), 
    dropout_p=(self.attn_drop.p if self.training else 0.0)
).transpose(1, 2).reshape(B, N, -1) 
```
(Be aware, I didn't test the proposed snippets)


	q = q * self.scale
	attn = (q @ k.transpose(-2, -1))

	attn = attn.softmax(dim=-1)
	attn = self.attn_drop(attn)

	x = (attn @ v).transpose(1, 2).reshape(B, N, -1)

	attn = (
	F.normalize(q, dim=-1) @ F.normalize(k, dim=-1).transpose(-2, -1))

	# torch.log(torch.tensor(1. / 0.01)) = 4.6052
	logit_scale = torch.clamp(self.scale, max=4.6052).exp()

	attn = attn * logit_scale

	attn = attn.softmax(dim=-1)
	attn = self.attn_drop(attn)

	x = (attn @ v).transpose(1, 2).reshape(B, N, -1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilize torch.nn.functional.scaled_dot_product_attention for more performance #580

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Utilize torch.nn.functional.scaled_dot_product_attention for more performance #580

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions