Local LLM performance issue #59
Replies: 1 comment
-
@dg6546 Thank you for sharing this challenge! Llamacpp might not be ideal for a lot of on-device use cases, and we don't do a great job optimizing for this in the current state; it would be great to explore the new improvements that Apple shipped with iOS 18 and use CoreML and other Apple technologies for the local LLM execution. PRs to support and extend SpeziLLM are always more than welcome! You can find documentation on how you can load a model file from anywhere within your app and without a UI using the Feel free to create any issues that you encounter in the SpeziLLM repo. PRs and contributions are always more than welcome! |
Beta Was this translation helpful? Give feedback.
-
What Stanford Spezi module is your challenge related to?
SpeziML
Description
Hi, my team is interested in your framework and I am currently exploring especially with LLM features.
I am a newbie to this framework so I have some questions.
I was trying on the localLLM module, the token generation is very slow even on a Q2K llama2 module (1 min per token at worst case). Before integrating with ChatView, the performance was much better. I am 80% sure that it is not a hardware issue since I have tried it on both my phone and emulator. May I request a code review on my sample project to see if I used wrong functions/config?
I observed a behavior that the model is loaded when the first user prompt is sent. May I ask is there any way to load the models beforehand? (e.g. when the app load)
I have considered about future use cases: updating LLM model. Onboardview seems like the best choice for the first time downloading the model. However, it does not support going back to next step. Currently I have implemented a remove model button in chatview, with binding a boolean to choose rather chatview or download view. May I ask if there is better practice of doing this?
I have uploaded the sample project to Github. Please find my repository from the link below / my profile.
Thank you so much for maintaining an open source projects and answering my questions!
Reproduction
https://github.com/dg6546/spezillm_demo
Expected behavior
Faster generation possible?
Additional context
N/A
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions