Flash Attention is not installed? #595

ObliviousDonkey · 2024-09-06T05:36:50Z

    raise NotImplementedError("flash attention is not installed")
NotImplementedError: flash attention is not installed
2024-09-06T05:28:11.268655Z ERROR shard-manager: lorax_launcher: Shard complete stan
dard error output:
/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUD
A initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda
functions before calling NumCudaDevices() that might have already set an error? Erro
r 803: system has unsupported display driver / cuda driver combination (Triggered in
ternally at /opt/conda/conda-bld/pytorch_1720538438429/work/c10/cuda/CUDAFunctions.c
pp:108.)
  return torch._C._cuda_getDeviceCount() > 0
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimi
zers, 8-bit multiplication, and GPU quantization are unavailable.
2024-09-06 05:28:02.507 | INFO     | lorax_server.utils.state:<module>:11 - Prefix c
aching = False
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 87, in se
rve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 408, i
n serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_co
mplete
    return future.result()
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 274, i
n serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", li
ne 165, in get_model
    from lorax_server.models.flash_llama import FlashLlama
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_llama.py",
 line 9, in <module>
    from lorax_server.models.custom_modeling.flash_llama_modeling import (
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/
flash_llama_modeling.py", line 32, in <module>
    from lorax_server.utils import flash_attn, paged_attention
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", l
ine 330, in <module>
    raise NotImplementedError("flash attention is not installed")
NotImplementedError: flash attention is not installed
 rank=0
2024-09-06T05:28:11.324488Z ERROR lorax_launcher: Shard 0 failed to start
2024-09-06T05:28:11.324520Z  INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart

What might be causing the issue? Im on rtx 3060

The text was updated successfully, but these errors were encountered:

poddarabhinav · 2024-09-11T14:56:16Z

try using this image as they have updated to cuda 12.4 there is some driver issue
docker image : ghcr.io/predibase/lorax:07addea

2022dc04025 · 2024-09-16T12:16:52Z

When using mistralai/Mistral-7B-Instruct-v0.1, I am getting the below error even if changing the docker image to ghcr.io/predibase/lorax:07addea

ImportError: Mistral model requires flash attn v2
rank=0
Error: ShardCannotStart

ObliviousDonkey · 2024-09-17T11:24:32Z

Any solutions?

poddarabhinav · 2024-09-17T11:29:41Z

I tried this with Llama 3.1, and the issue was that the Nvidia driver didn't support the CUDA version in the Lorax Docker image. In my case, when I executed commands in the Docker image, CUDA was not available, so I tried different Docker images provided by them. Luckily, ghcr.io/predibase/lorax:07addea worked. Apparently, CUDA 12.4 had a mismatch with Nvidia driver version 550.9.

ObliviousDonkey · 2024-09-18T14:18:21Z

I tried this with Llama 3.1, and the issue was that the Nvidia driver didn't support the CUDA version in the Lorax Docker image. In my case, when I executed commands in the Docker image, CUDA was not available, so I tried different Docker images provided by them. Luckily, ghcr.io/predibase/lorax:07addea worked. Apparently, CUDA 12.4 had a mismatch with Nvidia driver version 550.9.

Thanks for the help. But after switching to the docker image you gave, I'm facing new errors when running unsloth/Meta-Llama-3.1-8B-bnb-4bit

Driver version: 535.183.01, Cuda Version: 12.2

Did you succeed running this one?

dumbPy · 2024-09-18T18:37:28Z

I too had this issue with that particular model.
You may try other models like hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 which gives me 55 tokens/sec on A100
or meta-llama/Meta-Llama-3.1-8B-Instruct that gives 60 tokens/sec on A100.

vsokolovskii · 2024-09-19T22:51:32Z

Tried the image you suggested, base model is unsloth 70B Llama3.1

getting some dimensions mismatches:

AssertionError: [41943040, 1] != [10240, 8192]
 rank=0
Error: ShardCannotStart

dumbPy mentioned this issue Sep 17, 2024

Issue with loading AWQ quantized Llama 3.1 70B #607

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention is not installed? #595

Flash Attention is not installed? #595

ObliviousDonkey commented Sep 6, 2024

poddarabhinav commented Sep 11, 2024

2022dc04025 commented Sep 16, 2024

ObliviousDonkey commented Sep 17, 2024

poddarabhinav commented Sep 17, 2024

ObliviousDonkey commented Sep 18, 2024 •

edited

Loading

dumbPy commented Sep 18, 2024

vsokolovskii commented Sep 19, 2024

Flash Attention is not installed? #595

Flash Attention is not installed? #595

Comments

ObliviousDonkey commented Sep 6, 2024

poddarabhinav commented Sep 11, 2024

2022dc04025 commented Sep 16, 2024

ObliviousDonkey commented Sep 17, 2024

poddarabhinav commented Sep 17, 2024

ObliviousDonkey commented Sep 18, 2024 • edited Loading

dumbPy commented Sep 18, 2024

vsokolovskii commented Sep 19, 2024

ObliviousDonkey commented Sep 18, 2024 •

edited

Loading