Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-time Stream Based AI Assistant #20

Open
lucasjinreal opened this issue Jan 22, 2025 · 5 comments
Open

Real-time Stream Based AI Assistant #20

lucasjinreal opened this issue Jan 22, 2025 · 5 comments

Comments

@lucasjinreal
Copy link
Owner

lucasjinreal commented Jan 22, 2025

Hello, this is one of my initial proposals for implementing a real-time stream-based AI assistant powered by pure Rust. Given Kokoro's significant role in text speech and the rapid evolution of Large Language Models (LLMs), here are my thoughts on how to achieve this. I will present model selection and the overall architecture. If you are interested, please comment below and share how you can contribute. Together, we can build it. The ultimate goal could be to implement a terminal voice AI assistant as a prototype.

Goal

A voice-based AI assistant (agents). It will possess voice understanding ability (ASR+) and Text-to-Speech (TTS) capabilities (currently mainly in Chinese, with stream mode). In addition to its perception (hearing and speaking), it can have the following abilities:

  • Calling tools such as your file explorer, calendar, computer browser, etc.
  • Having memories, not in the form of Retrieval-Augmented Generation (RAG), but through memory extraction of some of your main ideas (similar to short, shared memories).
  • Having an interface to control more things, such as your home intelligent devices.

With these three main goals, I believe this will be an assistant that lives with you, understands you, and helps you with many daily tasks.

On the engineering side, two rules should be followed:

  • Models need to be a combination of cloud and local. Tiny models should run fast and include Voice Activity Detection (VAD), Automatic Speech Recognition (ASR), and TTS, etc.
  • Agents should be reusable.

Checkpoints

  • Stage 1: A workable version that stitches components such as ASR, LLM, and TTS.
  • Stage 2: An audio model that combines LLM and audio encoder to understand audio input and perform TTS.
  • Stage 3: An end-to-end multimodal model similar to 4o that can understand voice and speak out with clear, nice, and expressive voice.
  • Stage 4: Become a Human Experience Replicator (HER).

Leave comments below let me saw your ideas.

Image

Useful links

  1. SLAM-Omini: https://github.com/X-LANCE/SLAM-LLM pure e2e, however, not sure under the hood
@lucasjinreal lucasjinreal pinned this issue Jan 22, 2025
@lucasjinreal
Copy link
Owner Author

output2_added_subtitle.mp4

This is how it like as for now.

@devilankur18
Copy link

@lucasjinreal sounds really interesting. Why you are limiting this only to rust ? you have any thoughts on runing it in browser using wasm / onnx ?

@lucasjinreal
Copy link
Owner Author

@devilankur18 Pleased to learn that you are interested in this topic.

Why limit it to Rust? There are several reasons:

  • I aim to deploy this "model" or an intelligent "hub" more easily. Python may save time in development, but it becomes truly annoying when dealing with large projects. Rust deployment can be a single binary file.
  • I believe that in some scenarios, Rust can be much faster.
  • Rust is equivalent to WebAssembly (wasm), a buildable language that is extremely easy to use on any platform.

Regarding the model part, it may not have just one model, so it could run through ONNX or Candle.

@devilankur18
Copy link

@lucasjinreal I was trying to find some benchmarks in rust using candles / burn, tried examples in browser, not sure if gains are much as of today. I am pretty new to rust ml, You have some bencharks for llm models ?

Also let me know where I can be of help.

@lucasjinreal
Copy link
Owner Author

@devilankur18 I think using a same model, such as Qwen2B, using llama.cpp, candle to run see the time consume.
Burn should used mainly for training, I would also like to see your result, hoping to see your updates interms of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants