Skip to content

AI Integration Plan #9

@MrScripty

Description

@MrScripty

Development Proposal: AI-Powered Text Generation (LLM Integration) for Studio-Whip

1. Introduction & Purpose

This proposal outlines the plan to integrate Large Language Model (LLM) capabilities into the Studio-Whip application. The primary goal is to empower users with AI-assisted text generation directly within their creative workflows, such as scriptwriting and story development. This feature will allow users to leverage local LLMs to generate, continue, or modify text content seamlessly within Studio-Whip's existing text editing interface.

The initial implementation will focus on supporting GGUF-formatted LLMs (e.g., Qwen3, Gemma3, QWQ, Mistral) via the llama-cpp-2 library, targeting NVIDIA CUDA-enabled GPUs and CPU inference. User experience is paramount, with features like real-time streaming of generated text and the ability to instantly cancel ongoing generations.

2. Goals & Scope

  • Core Functionality:
    • Load user-specified GGUF LLM models.
    • Perform text generation based on user prompts within existing Studio-Whip text objects.
    • Stream generated text tokens in real-time into the target text object.
    • Allow users to cancel ongoing text generation requests perceived as instantly.
  • Technical Scope (Phase 1):
    • Integration of the llama-cpp-2 Rust library.
    • Support for CUDA (NVIDIA GPU) and CPU-based inference.
    • Configuration of models via TOML files in a user/ai_models/ directory.
    • Robust error handling and logging using bevy_log.
  • Out of Scope (Phase 1):
    • Advanced prompt engineering UI (initial input will be simple text prompts).
    • Concurrent generation from multiple LLMs simultaneously (will be sequential initially).
    • Vulkan compute backend for llama.cpp (deferred for cross-platform GPU).
    • Support for other AI modalities (image, audio, etc. – this LLM system will serve as a foundational pattern).

3. Proposed Architecture & Design

The AI functionality will be encapsulated within a new, dedicated Rust module (src/ai/) within the existing rusty_whip crate. This module will integrate with the Bevy application primarily through Bevy's ECS (Entities, Components, Systems) and event system.

3.1. Core Components of the AI Module:

  • AiPlugin (Bevy Plugin):
    • Purpose: Initializes AI-related resources and systems.
    • Functionality: Manages the lifecycle of the AI module within the Bevy app.
  • AiModelManager (Bevy Resource):
    • Purpose: Manages the loading, unloading, and access to AI models.
    • Functionality:
      • Discovers available models from user configuration files.
      • Handles asynchronous model loading (to prevent UI freezes).
      • Tracks loaded models (HashMap<ModelId, Arc<dyn LlmModel>>).
      • Manages active generation tasks and their cancellation flags (HashMap<Uuid, Arc<AtomicBool>>).
      • Provides an interface for other systems to request model operations.
  • Model Abstractions (ai::llm::model):
    • LlmModel Trait: Defines a common interface for LLMs (e.g., stream_generate, metadata).
    • LlamaCppModel Struct: Implements LlmModel using the llama-cpp-2 backend. Handles the specifics of token-by-token generation and cancellation polling.
  • Backend Wrapper (ai::backends::llama_cpp_2):
    • Purpose: Provides a safe and ergonomic Rust interface over the llama-cpp-2 library.
    • Functionality: Manages llama.cpp model and context lifecycles, parameter translation, GPU offloading, and the core inference loop.
  • Event System (ai::events):
    • Purpose: Decouples AI operations from direct GUI calls, enabling asynchronous processing.
    • Key Events:
      • LlmLoadRequest: Signals a request to load a model.
      • LlmLoadResult: Reports the outcome of a model load attempt.
      • LlmGenerateRequest: Signals a request to generate text.
      • LlmCancelRequest: Signals a request to cancel an ongoing generation.
      • LlmTokenStreamEvent: Carries individual generated tokens to be appended to the UI.
      • LlmGenerationComplete: Signals the end or failure of a generation task.
  • Configuration (user/ai_models/*.toml & ai::llm::config):
    • Purpose: Allows users to define which models to load and their specific parameters (e.g., GGUF path, GPU layers).
    • LlmModelUserConfig Struct: Deserializes these TOML files.

3.2. Interaction with GUI Framework:

The AI module will interface with the existing Bevy-based GUI framework as follows:

  1. Request Initiation (GUI -> AI):
    • A user action in the GUI (e.g., clicking a "Generate" button associated with an EditableText entity, or a hotkey) triggers a GUI system.
    • This GUI system gathers the prompt (from the Yrs data of the target EditableText), the target Entity ID, and other parameters.
    • It then sends an LlmGenerateRequest Bevy event.
    • The GUI system can also update its state (e.g., show a loading spinner, add an AwaitingAiResponse component to the target entity).
  2. Asynchronous AI Processing:
    • The AiPlugin's handle_llm_generate_requests_system receives the request.
    • It uses the AiModelManager to get the specified LlmModel.
    • An asynchronous Bevy task is spawned to perform the generation via LlmModel::stream_generate. This task includes:
      • An mpsc channel for sending generated tokens back.
      • An Arc<AtomicBool> cancellation flag, stored by the AiModelManager.
    • The LlamaCppModel's stream_generate implementation calls llama-cpp-2 in a loop, polling the cancellation flag after each token.
  3. Streaming Results (AI -> GUI via Yrs):
    • As the async task receives tokens from llama-cpp-2 (via the mpsc channel within stream_generate), it queues LlmTokenStreamEvent data using a shared, thread-safe queue (e.g., AsyncAiTaskOutputs resource).
    • A Bevy system (forward_async_ai_events_system) on the main thread drains this queue and sends the actual LlmTokenStreamEvent Bevy events.
    • The apply_llm_tokens_to_yrs_system (in AiPlugin) listens for LlmTokenStreamEvent.
      • It retrieves the YrsDocResource and the target yrs::TextRef (identified by event.target_yrs_text_entity from the text_map in YrsDocResource).
      • It appends the event.token to the yrs::TextRef.
      • Crucially, it then sends a YrsTextChanged { entity: event.target_yrs_text_entity } event.
  4. GUI Update (Reactive via Yrs):
    • Studio-Whip's existing text_layout_system (in gui_framework::plugins::core.rs) already listens for YrsTextChanged events.
    • Upon receiving this event, it re-layouts the text, and the custom Vulkan renderer displays the updated content in the next frame. This provides real-time streaming without direct AI-to-renderer calls.
  5. Cancellation (GUI -> AI -> Task):
    • User clicks a "Cancel" button.
    • GUI system sends LlmCancelRequest { request_id }.
    • handle_llm_cancel_requests_system (in AiPlugin) finds the request_id's Arc<AtomicBool> in AiModelManager and sets it to true.
    • The LlamaCppModel::stream_generate loop detects the flag and terminates, sending an LlmGenerationComplete event with a "Cancelled" status.
  6. Completion/Error Handling (AI -> GUI):
    • When generation finishes (normally, cancelled, or error), the async task queues an LlmGenerationComplete event.
    • forward_async_ai_events_system sends this Bevy event.
    • GUI systems listen for LlmGenerationComplete to update UI (hide spinner, show error messages, remove AwaitingAiResponse component).

3.3. Reasoning Behind Design Decisions:

  • Modularity (ai module): Keeps AI concerns separate, facilitating future expansion to other AI types (image, audio) using similar patterns.
  • Bevy Events: Provides loose coupling between GUI and AI, essential for asynchronous operations and testability. Aligns with existing GUI framework patterns.
  • Yrs for Text Streaming: Leverages Studio-Whip's existing CRDT infrastructure for efficient, real-time, and potentially collaborative text updates. The AI module simply "pushes" data into Yrs, and the GUI reacts.
  • llama-cpp-2: Chosen for its active development, focus on staying current with llama.cpp, and direct C++ bindings suitable for Rust.
  • Asynchronous Tasks (AsyncComputeTaskPool): Prevents UI freezes during model loading and inference, crucial for good UX.
  • AtomicBool for Cancellation: A standard, lightweight mechanism for signalling cancellation to long-running tasks, ensuring responsiveness.
  • User Configuration (TOML): Simple, human-readable way for users to manage their local models.

4. Implementation Plan & Actionable Steps

The implementation will be phased. Each step should be testable.

Phase 1.0: AI Module Skeleton & Configuration

  1. Create Directory Structure:
    • Create src/ai/ with mod.rs, common.rs, error.rs, events.rs.
    • Create src/ai/llm/ with mod.rs, model.rs, config.rs.
    • Create src/ai/backends/ with mod.rs.
    • Create src/ai/backends/llama_cpp_2/ with mod.rs.
    • Create user/ai_models/ directory.
  2. Define Core Types & Events:
    • Implement structs/enums in ai::common.rs (ModelId, ModelType, InferenceDevice, ModelLoadConfig, ModelMetadata).
    • Implement AiError in ai::error.rs.
    • Implement event structs in ai::events.rs (all Llm* events, including LlmCancelRequest).
    • Implement LlmModelUserConfig in ai::llm::config.rs for TOML deserialization.
  3. AiPlugin and AiModelManager (Basic Structure):
    • Create AiPlugin in ai::mod.rs.
    • Create AiModelManager resource in ai::manager.rs with empty HashMaps for models and active generations.
    • Implement the discover_and_request_model_loads_system to scan user/ai_models/ and send LlmLoadRequest events (initially, these requests won't be fully processed).
    • Add AiPlugin to main.rs.
  4. Testing:
    • Verify the plugin loads.
    • Create a sample model.toml file.
    • Verify discover_and_request_model_loads_system runs and sends LlmLoadRequest events (log the events).

Phase 1.1: llama-cpp-2 Backend Wrapper & Model Loading

  1. Add llama-cpp-2 Dependency:
    • Add llama-cpp-2 to Cargo.toml with appropriate feature flags for CPU and CUDA (e.g., features = ["cuda"] if building for CUDA).
    • Ensure git submodule update --init --recursive is run if llama-cpp-2 is a git submodule or if its dependencies require it.
  2. Implement ai::backends::llama_cpp_2::mod.rs:
    • Write wrapper functions to:
      • Initialize llama.cpp backend.
      • Load a GGUF model using llama_cpp_2::LlamaModel::load_from_file based on ModelLoadConfig (path, gpu_layers).
      • Handle potential errors from llama-cpp-2 and convert them to AiError.
  3. Implement LlmModel Trait and LlamaCppModel:
    • Define the LlmModel trait in ai::llm::model.rs.
    • Implement LlamaCppModel struct holding the loaded llama_cpp_2::LlamaModel.
    • Implement the metadata, estimate_vram/ram (initially can be placeholders or read from GGUF if API allows) methods.
  4. Enhance AiModelManager and Loading Systems:
    • Implement handle_model_load_requests_system:
      • Spawn an async Bevy task.
      • Task uses the llama_cpp_2 wrapper to load the model.
      • Task sends results (model instance or error) back to the main thread (e.g., via AsyncAiTaskOutputs queue).
    • Implement process_model_load_results_system:
      • Drains the result queue.
      • If successful, creates Arc<LlamaCppModel> and stores it in AiModelManager.models.
      • Sends LlmLoadResult Bevy event.
  5. Testing:
    • Place a small GGUF model (e.g., a tiny test model or a small Mistral quant) and its TOML config in user/ai_models/.
    • Run the app. Verify (via logs and LlmLoadResult events) that the model loads successfully on CPU.
    • If CUDA is set up, test GPU offloading by setting gpu_layers in the TOML. Verify llama.cpp logs indicate GPU usage.
    • Test error handling for incorrect paths or corrupted GGUF files.

Phase 1.2: Text Generation Streaming & Cancellation

  1. Implement LlamaCppModel::stream_generate:
    • Takes prompt, params, token_tx: mpsc::Sender, and cancel_flag: Arc<AtomicBool>.
    • Sets up a llama_cpp_2::LlamaContext.
    • Enters the llama.cpp token generation loop.
    • Inside the loop:
      • Poll cancel_flag. If true, break, send Err(AiError::Cancelled) via token_tx (or a separate completion channel), and return.
      • Get the next token string from llama-cpp-2.
      • Send Ok(token_string) via token_tx.
    • When the loop finishes (EOS or error), ensure token_tx is closed or a final completion message is sent.
  2. Implement handle_llm_generate_requests_system:
    • As described in section 3.2.2, this system receives LlmGenerateRequest, sets up cancellation, spawns the async task which calls stream_generate and forwards tokens/completion via AsyncAiTaskOutputs.
  3. Implement handle_llm_cancel_requests_system:
    • Receives LlmCancelRequest, finds the Arc<AtomicBool> in AiModelManager.active_generations, and sets it to true.
  4. Implement forward_async_ai_events_system:
    • Drains AsyncAiTaskOutputs and sends LlmTokenStreamEvent and LlmGenerationComplete Bevy events.
  5. Implement apply_llm_tokens_to_yrs_system:
    • Receives LlmTokenStreamEvent.
    • Appends token to the target Yrs TextRef.
    • Sends YrsTextChanged event.
  6. GUI Integration (Basic):
    • Create a temporary Bevy system (or use a debug UI if you have one) that can:
      • Send an LlmGenerateRequest for a hardcoded prompt and target EditableText entity (the one created in setup_scene_ecs).
      • Send an LlmCancelRequest for that generation.
    • Add basic UI state components (AwaitingAiResponse) and systems to manage them based on LlmGenerateRequest and LlmGenerationComplete.
  7. Testing:
    • Trigger a generation. Verify text streams into the sample EditableText UI element.
    • Verify your text_layout_system and renderer update the display in real-time.
    • Trigger cancellation during generation. Verify generation stops promptly and the UI updates accordingly (e.g., "Cancelled" message, spinner stops).
    • Test generation completion (EOS token).
    • Test error conditions during inference (if possible to simulate).

Phase 1.3: Refinement & Polish

  1. Error Handling & Logging:
    • Ensure all Result types are handled.
    • Use bevy_log macros (info!, warn!, error!) appropriately throughout the AI module.
    • Propagate meaningful errors to the user via LlmGenerationComplete or LlmLoadResult events, allowing the GUI to display them.
  2. Code Cleanup & Documentation:
    • Add comments and documentation to new AI modules and systems.
    • Refactor for clarity and efficiency.
  3. Basic GUI Integration for Triggering:
    • Implement a simple button or hotkey within the main application to trigger generation on the currently focused EditableText (as sketched in the previous response).

5. Future Considerations (Post-Phase 1)

  • Concurrent LLM generations.
  • Vulkan compute backend for llama.cpp.
  • Integration of other AI modalities (Image, Audio), following similar architectural patterns.
  • Advanced prompt engineering UI.
  • More sophisticated model management UI within Studio-Whip.
  • Memory monitoring and dynamic unloading/loading of models based on usage and system resources.

This phased approach allows for incremental development and testing, ensuring each part of the system is functional before building upon it. The focus on leveraging existing Bevy and Yrs patterns should make the integration relatively smooth and maintainable.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions