[WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. #299

raikonenfnu · 2023-12-23T16:45:48Z

Currently our decode initialization/prefill stage is suboptimal compared to the decoding phase. When the token length is rather small, our initalization/prefill is rather quick, but once it gets a bit higher then it starts becoming rather slow.

Around 2.2-2.8 second at the 500 token mark, and around 5-5.5 seconds when the token len/history is at 1000. This is not very good for multi-dialogue/interactive prompting cases.

KV-Cache has always been known to help our perf during decoding phase. Currently, we are recomputing all the PKV after every new prompt/round/prefill stage, this is rather redundant. This PR introduces the use of KV-Cache at initialization/prefill stage which keeps the time taken to first token/prefill stage at around 0.2 second even at > 1000 token mark.

This PR has been extended to introduce streamingLLM functionality, this will allow us to generate infinite tokens under controlled memory growh.

Future work for streamingLLM:

Make window size configurable through python, everything is there but we'd need to initialize with a default value which would only be possible after we let _create_initial_value to take in initial value from GlobalAttribute somewhere here .
Get flow.move to enable overlap of sliding window and src of data. (Currently need to evict when it's at least 2X size of window)
Introduce Rerotation of RoPE to as seen here to remove invasive modification of LlamaAttention module for streamingLLM.

This PR also introduce:
1.Set capabilities of GlobalScalars
2.Inheritance of exports/globals for CompiledModule subclasses.
3.READMEs for llm_runner and stateless_llama
4.e2e test refactoring

python/turbine_models/custom_models/stateless_llama.py

-added LlamaAttn modification for shifted pos -Made app only evict once we have atleast 2x window size to circumvent copy src and dst data overlap. TODO: -Implement evict cache + check in decode step as well -Make window size configurable through python -Fix flow.move -Compile down SinkCache from upstream.

stellaraccident

Cool. Can you add two unit tests for the framework changes?

python/turbine_models/custom_models/llm_app.py

python/turbine_models/custom_models/stateless_llama.py

dan-garvey

I'm fine with file structure in turbine_models. Please add tests for modify llama and the app

python/shark_turbine/aot/compiled_module.py

python/turbine_models/custom_models/llm_app.py

raikonenfnu · 2024-01-04T20:00:30Z

I'm fine with file structure in turbine_models. Please add tests for modify llama and the app

Hey @dan-garvey I think I am missing something, can you elaborate a bit on how we should test the app?

IanNod · 2024-01-04T20:46:41Z

Hey @dan-garvey I think I am missing something, can you elaborate a bit on how we should test the app?

You can look at how we are e2e testing the current stateless_llama app here https://github.com/nod-ai/SHARK-Turbine/blob/main/python/turbine_models/tests/stateless_llama_test.py.

raikonenfnu · 2024-01-05T09:47:26Z

I'm fine with file structure in turbine_models. Please add tests for modify llama and the app

You can look at how we are e2e testing the current stateless_llama app here https://github.com/nod-ai/SHARK-Turbine/blob/main/python/turbine_models/tests/stateless_llama_test.py.

Thanks Dan and Ian! I added the e2e compile and run for streamingLLM vmfb, as well as comparing result between modified_llama and regular llama in torch.

raikonenfnu · 2024-01-05T09:50:16Z

Cool. Can you add two unit tests for the framework changes?

Thanks for taking a break from your break to look at this Stella, I added the unit tests for the framework changes.

python/turbine_models/custom_models/README.md

python/turbine_models/custom_models/llm_runner.py

dan-garvey

looks great! thanks!

raikonenfnu added 2 commits December 23, 2023 08:35

[Llama2] Add KVCache initialization + Simple app runner.

978f5e3

Black Lint

369ad35

raikonenfnu commented Dec 23, 2023

View reviewed changes

python/turbine_models/custom_models/stateless_llama.py Outdated Show resolved Hide resolved

raikonenfnu added 3 commits December 23, 2023 09:15

Another round of lint

c86b7b0

Statistics print + refactor kvcache-init.

2227947

raikonenfnu force-pushed the streaming branch from 88c3520 to b81e32f Compare January 4, 2024 18:28

Added evict in decode + general cleanup and refactor.

1af7c2d

raikonenfnu force-pushed the streaming branch from b81e32f to 1af7c2d Compare January 4, 2024 18:39

raikonenfnu requested review from dan-garvey and aviator19941 January 4, 2024 18:42

gitignore lint.

dbfa61e

raikonenfnu requested a review from stellaraccident January 4, 2024 18:53

Relint gitignore.

adc2e19

stellaraccident reviewed Jan 4, 2024

View reviewed changes

IanNod reviewed Jan 4, 2024

View reviewed changes

python/turbine_models/custom_models/llm_app.py Outdated Show resolved Hide resolved

python/turbine_models/custom_models/stateless_llama.py Outdated Show resolved Hide resolved

dan-garvey requested changes Jan 4, 2024

View reviewed changes

python/shark_turbine/aot/compiled_module.py Outdated Show resolved Hide resolved

python/turbine_models/custom_models/llm_app.py Outdated Show resolved Hide resolved

python/turbine_models/custom_models/llm_app.py Outdated Show resolved Hide resolved

raikonenfnu added 4 commits January 4, 2024 13:45

Add tests

8debaa5

Refactor llm_runner to have chat mode and remove llm_app.py

e4b4ad2

Re-enable llama test

4fcf8b4

e2e llama test + READMEs.

0064f90

raikonenfnu requested review from IanNod and dan-garvey January 5, 2024 09:57

raikonenfnu changed the title ~~[WIP][Llama2] Add KVCache for prefill stage + Simple interactive app runner.~~ [WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. Jan 5, 2024

IanNod reviewed Jan 5, 2024

View reviewed changes

python/turbine_models/custom_models/README.md Outdated Show resolved Hide resolved

python/turbine_models/custom_models/llm_runner.py Outdated Show resolved Hide resolved

Fix nit of arg description.

2c770fc

raikonenfnu requested a review from IanNod January 5, 2024 17:48

IanNod approved these changes Jan 5, 2024

View reviewed changes

dan-garvey approved these changes Jan 5, 2024

View reviewed changes

raikonenfnu merged commit 432fa0d into nod-ai:main Jan 5, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. #299

[WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. #299

raikonenfnu commented Dec 23, 2023 •

edited

Loading

stellaraccident left a comment

dan-garvey left a comment

raikonenfnu commented Jan 4, 2024

IanNod commented Jan 4, 2024

raikonenfnu commented Jan 5, 2024 •

edited

Loading

raikonenfnu commented Jan 5, 2024

dan-garvey left a comment

[WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. #299

[WIP][Llama2] Add KVCache for prefill stage + interactive chat mode in llm_runner + StreamingLLM. #299

Conversation

raikonenfnu commented Dec 23, 2023 • edited Loading

stellaraccident left a comment

Choose a reason for hiding this comment

dan-garvey left a comment

Choose a reason for hiding this comment

raikonenfnu commented Jan 4, 2024

IanNod commented Jan 4, 2024

raikonenfnu commented Jan 5, 2024 • edited Loading

raikonenfnu commented Jan 5, 2024

dan-garvey left a comment

Choose a reason for hiding this comment

raikonenfnu commented Dec 23, 2023 •

edited

Loading

raikonenfnu commented Jan 5, 2024 •

edited

Loading