Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killed: Out of Memory on Jetson Orion #261

Open
sfatimakhan opened this issue Feb 14, 2025 · 1 comment
Open

Killed: Out of Memory on Jetson Orion #261

sfatimakhan opened this issue Feb 14, 2025 · 1 comment

Comments

@sfatimakhan
Copy link

sfatimakhan commented Feb 14, 2025

Thank you for the great work, I really appreciate it.

  1. Conda Environment: 3.10
  2. Device: Jetson Orin Nano Developer Kit 8GB - Jetpack 6.0
  3. Model: [llama-2-7b-chat-hf ] https://huggingface.co/meta-llama/Llama-2-7b-chat-hf as mentioned in the Readme

I followed the steps as mentioned, but when running Tinychat I repeatedly encountered an issue:

  1. Initially, I tried to "Perform the AWQ search" but ran Out of Memory & the process was "Killed".
  2. Then I tried to run the pre-saved results instead to run Tiny Chat [ .pt files provided in awq_cache] but eventually ran into the same issue ("Killed")

I have tried these solutions to resolve the issue, but none were helpful:

  1. I added swap space of 16GB and then 64GB, but the process was still killed. I suspect that it may have timed out and was eventually terminated.
  2. I then reduced the quantization group size from 128 to 64, but this did not resolve the issue either.

Sometimes these solutions work but eventually get killed after a few conversations.Could you please suggest the possible solution or optimization to resolve this memory issue? I've attached a screen shot for better understanding.

Image

@ys-2020
Copy link
Contributor

ys-2020 commented Feb 18, 2025

Hi, it seems you did not activate flash attention here. The memory of Nano is very restricted. When sequence length gets larger , the memory consumption of attention grows quadratically (if flash_attention is not enabled). That's why the program runs into OOM after several rounds of conversation. We also do not suggest doing quantization on Jetson Orin Nano.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants