Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. #299

Merged
merged 13 commits into from
Jan 5, 2024

Conversation

raikonenfnu
Copy link
Member

@raikonenfnu raikonenfnu commented Dec 23, 2023

Currently our decode initialization/prefill stage is suboptimal compared to the decoding phase. When the token length is rather small, our initalization/prefill is rather quick, but once it gets a bit higher then it starts becoming rather slow.

Around 2.2-2.8 second at the 500 token mark, and around 5-5.5 seconds when the token len/history is at 1000. This is not very good for multi-dialogue/interactive prompting cases.

KV-Cache has always been known to help our perf during decoding phase. Currently, we are recomputing all the PKV after every new prompt/round/prefill stage, this is rather redundant. This PR introduces the use of KV-Cache at initialization/prefill stage which keeps the time taken to first token/prefill stage at around 0.2 second even at > 1000 token mark.

This PR has been extended to introduce streamingLLM functionality, this will allow us to generate infinite tokens under controlled memory growh.

Future work for streamingLLM:

  • Make window size configurable through python, everything is there but we'd need to initialize with a default value which would only be possible after we let _create_initial_value to take in initial value from GlobalAttribute somewhere here .
  • Get flow.move to enable overlap of sliding window and src of data. (Currently need to evict when it's at least 2X size of window)
  • Introduce Rerotation of RoPE to as seen here to remove invasive modification of LlamaAttention module for streamingLLM.

This PR also introduce:
1.Set capabilities of GlobalScalars
2.Inheritance of exports/globals for CompiledModule subclasses.
3.READMEs for llm_runner and stateless_llama
4.e2e test refactoring

-added LlamaAttn modification for shifted pos
-Made app only evict once we have atleast 2x window size to circumvent
copy src and dst data overlap.

TODO:
-Implement evict cache + check in decode step as well
-Make window size configurable through python
-Fix flow.move
-Compile down SinkCache from upstream.
Copy link
Contributor

@stellaraccident stellaraccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Can you add two unit tests for the framework changes?

Copy link
Member

@dan-garvey dan-garvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with file structure in turbine_models. Please add tests for modify llama and the app

python/shark_turbine/aot/compiled_module.py Outdated Show resolved Hide resolved
python/turbine_models/custom_models/llm_app.py Outdated Show resolved Hide resolved
python/turbine_models/custom_models/llm_app.py Outdated Show resolved Hide resolved
@raikonenfnu
Copy link
Member Author

I'm fine with file structure in turbine_models. Please add tests for modify llama and the app

Hey @dan-garvey I think I am missing something, can you elaborate a bit on how we should test the app?

@IanNod
Copy link
Contributor

IanNod commented Jan 4, 2024

Hey @dan-garvey I think I am missing something, can you elaborate a bit on how we should test the app?

You can look at how we are e2e testing the current stateless_llama app here https://github.com/nod-ai/SHARK-Turbine/blob/main/python/turbine_models/tests/stateless_llama_test.py.

@raikonenfnu
Copy link
Member Author

raikonenfnu commented Jan 5, 2024

I'm fine with file structure in turbine_models. Please add tests for modify llama and the app

You can look at how we are e2e testing the current stateless_llama app here https://github.com/nod-ai/SHARK-Turbine/blob/main/python/turbine_models/tests/stateless_llama_test.py.

Thanks Dan and Ian! I added the e2e compile and run for streamingLLM vmfb, as well as comparing result between modified_llama and regular llama in torch.

@raikonenfnu
Copy link
Member Author

Cool. Can you add two unit tests for the framework changes?

Thanks for taking a break from your break to look at this Stella, I added the unit tests for the framework changes.

@raikonenfnu raikonenfnu changed the title [WIP][Llama2] Add KVCache for prefill stage + Simple interactive app runner. [WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. Jan 5, 2024
@raikonenfnu raikonenfnu requested a review from IanNod January 5, 2024 17:48
Copy link
Member

@dan-garvey dan-garvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thanks!

@raikonenfnu raikonenfnu merged commit 432fa0d into nod-ai:main Jan 5, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants