-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. #299
Conversation
-added LlamaAttn modification for shifted pos -Made app only evict once we have atleast 2x window size to circumvent copy src and dst data overlap. TODO: -Implement evict cache + check in decode step as well -Make window size configurable through python -Fix flow.move -Compile down SinkCache from upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Can you add two unit tests for the framework changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with file structure in turbine_models. Please add tests for modify llama and the app
Hey @dan-garvey I think I am missing something, can you elaborate a bit on how we should test the app? |
You can look at how we are e2e testing the current stateless_llama app here https://github.com/nod-ai/SHARK-Turbine/blob/main/python/turbine_models/tests/stateless_llama_test.py. |
Thanks Dan and Ian! I added the e2e compile and run for streamingLLM vmfb, as well as comparing result between modified_llama and regular llama in torch. |
Thanks for taking a break from your break to look at this Stella, I added the unit tests for the framework changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! thanks!
Currently our decode initialization/prefill stage is suboptimal compared to the decoding phase. When the token length is rather small, our initalization/prefill is rather quick, but once it gets a bit higher then it starts becoming rather slow.
Around 2.2-2.8 second at the 500 token mark, and around 5-5.5 seconds when the token len/history is at 1000. This is not very good for multi-dialogue/interactive prompting cases.
KV-Cache has always been known to help our perf during decoding phase. Currently, we are recomputing all the PKV after every new prompt/round/prefill stage, this is rather redundant. This PR introduces the use of KV-Cache at initialization/prefill stage which keeps the time taken to first token/prefill stage at around 0.2 second even at > 1000 token mark.
This PR has been extended to introduce streamingLLM functionality, this will allow us to generate infinite tokens under controlled memory growh.
Future work for streamingLLM:
_create_initial_value
to take in initial value from GlobalAttribute somewhere here .This PR also introduce:
1.Set capabilities of GlobalScalars
2.Inheritance of exports/globals for CompiledModule subclasses.
3.READMEs for llm_runner and stateless_llama
4.e2e test refactoring