A voice-controlled operating system that is general-purposed, low-latency, transparent, user-friendly and possesses search and screen-analysis capabilities.
conda activate berkos
export OPENAI_API_KEY=<OPENAI_API_KEY>
export MISTRAL_API_KEY=<MISTRAL_API_KEY>
export NVIDIA_API_KEY=<NVIDIA_API_KEY>
-
User interface: This (application) layer is for voice control and is powered by OpenAI's RealTime API. In the RealTime session, user transcripts were decoded and sent to the Assembler for code generation.
-
Assembler: The purpose of this (operating system-compiler) layer is to generate "assembly-like" instructions for the processor. Such instructions include any non-dangerous UNIX commands, device commands (
LEFT_CLICK x y
,KEYBOARD string
) and the screen-processing commandANALYSIS
. -
Processor: The processor executes the instructions generated by the assembler. For example, the special instruction
ANALYSIS
takes a screenshot and uses 3 AI models (NVIDIA's NeVA, Mistral AI's Pixtral, OpenAI's GPT) in parallel (Python's asyncio) to extract information from it. The collated information is fed back to both the assembler and RealTime. This layer of LLMaOS deviates from the traditional computer architecture in the sense that instructions are generated on the fly. For example,ANALYSIS
on an image must be done first before determining thex
andy
arguments for the nextLEFT_CLICK
instruction. Much like a motherboard, the processor can offload tasks to large models' API endpoints, analogous to specialised hardware accelerators.
(1) "Play me the song Espresso"
- LLMaOS launches Chrome, enters youtube.com in the URL bar, enters Espresso in the YouTube search bar, clicks the first non-ad entry, clicks Skip Ads, enters fullscreen
(2) "What is the score between Manchester City and Real Madrid?"
- LLMaOS launches Chrome, enters google.com in the URL bar, enters "Man City vs Real Madrid" in the search bar, analyses the screen, and tells you the score.
(3) "When is the next Codeforces contest?"
- LLMaOS launches Chrome, enters codeforces.com in the URL bar, analyses the screen, and tells you the time of the next contest.
LLMaOS is voice controlled, transparent (you keep a log of its "assembly-level" instructions and can see what it is doing), and possesses screen-processing capabilities.