Cuda implementation of transformers
Only forward layers are implemented for now.
The following layers were implemented:
- Matmul forward
- Gelu forward
- Feed forward layer forward
- Softmax forward
- CrossEntropyLoss forward
- Layernorm forward
- Encoder forward
- FlashAttention 2 forward