Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash Attention is not installed? #595

Open
ObliviousDonkey opened this issue Sep 6, 2024 · 7 comments
Open

Flash Attention is not installed? #595

ObliviousDonkey opened this issue Sep 6, 2024 · 7 comments

Comments

@ObliviousDonkey
Copy link

    raise NotImplementedError("flash attention is not installed")
NotImplementedError: flash attention is not installed
2024-09-06T05:28:11.268655Z ERROR shard-manager: lorax_launcher: Shard complete stan
dard error output:
/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUD
A initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda
functions before calling NumCudaDevices() that might have already set an error? Erro
r 803: system has unsupported display driver / cuda driver combination (Triggered in
ternally at /opt/conda/conda-bld/pytorch_1720538438429/work/c10/cuda/CUDAFunctions.c
pp:108.)
  return torch._C._cuda_getDeviceCount() > 0
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimi
zers, 8-bit multiplication, and GPU quantization are unavailable.
2024-09-06 05:28:02.507 | INFO     | lorax_server.utils.state:<module>:11 - Prefix c
aching = False
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 87, in se
rve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 408, i
n serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_co
mplete
    return future.result()
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 274, i
n serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", li
ne 165, in get_model
    from lorax_server.models.flash_llama import FlashLlama
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_llama.py",
 line 9, in <module>
    from lorax_server.models.custom_modeling.flash_llama_modeling import (
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/
flash_llama_modeling.py", line 32, in <module>
    from lorax_server.utils import flash_attn, paged_attention
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", l
ine 330, in <module>
    raise NotImplementedError("flash attention is not installed")
NotImplementedError: flash attention is not installed
 rank=0
2024-09-06T05:28:11.324488Z ERROR lorax_launcher: Shard 0 failed to start
2024-09-06T05:28:11.324520Z  INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart

What might be causing the issue? Im on rtx 3060

@poddarabhinav
Copy link

try using this image as they have updated to cuda 12.4 there is some driver issue
docker image : ghcr.io/predibase/lorax:07addea

@2022dc04025
Copy link

When using mistralai/Mistral-7B-Instruct-v0.1, I am getting the below error even if changing the docker image to ghcr.io/predibase/lorax:07addea

ImportError: Mistral model requires flash attn v2
rank=0
Error: ShardCannotStart

@ObliviousDonkey
Copy link
Author

Any solutions?

@poddarabhinav
Copy link

I tried this with Llama 3.1, and the issue was that the Nvidia driver didn't support the CUDA version in the Lorax Docker image. In my case, when I executed commands in the Docker image, CUDA was not available, so I tried different Docker images provided by them. Luckily, ghcr.io/predibase/lorax:07addea worked. Apparently, CUDA 12.4 had a mismatch with Nvidia driver version 550.9.

@ObliviousDonkey
Copy link
Author

ObliviousDonkey commented Sep 18, 2024

Screenshot_2024-09-18-17-21-03-502_com server auditor ssh client

I tried this with Llama 3.1, and the issue was that the Nvidia driver didn't support the CUDA version in the Lorax Docker image. In my case, when I executed commands in the Docker image, CUDA was not available, so I tried different Docker images provided by them. Luckily, ghcr.io/predibase/lorax:07addea worked. Apparently, CUDA 12.4 had a mismatch with Nvidia driver version 550.9.

Thanks for the help. But after switching to the docker image you gave, I'm facing new errors when running unsloth/Meta-Llama-3.1-8B-bnb-4bit

Driver version: 535.183.01, Cuda Version: 12.2

Did you succeed running this one?

@dumbPy
Copy link

dumbPy commented Sep 18, 2024

I too had this issue with that particular model.
You may try other models like hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 which gives me 55 tokens/sec on A100
or meta-llama/Meta-Llama-3.1-8B-Instruct that gives 60 tokens/sec on A100.

@vsokolovskii
Copy link

Tried the image you suggested, base model is unsloth 70B Llama3.1

getting some dimensions mismatches:

AssertionError: [41943040, 1] != [10240, 8192]
 rank=0
Error: ShardCannotStart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants