Skip to content

Teamhelper-Research-Lab/llmedge-examples

Repository files navigation

llmedge Examples

Comprehensive demonstration applications for the llmedge Android library, showcasing on-device language model inference, RAG pipelines, image generation, and video synthesis capabilities.

Main Library Repository: https://github.com/Aatricks/llmedge

Overview

This example application provides production-ready demonstrations of llmedge's core features. Each activity is designed to illustrate best practices for model loading, memory management, and efficient on-device inference.

Included Demonstrations

Language Model Inference

Local Asset Demo (LocalAssetDemoActivity.kt)

  • Demonstrates loading GGUF models bundled within the APK
  • Illustrates asset extraction to app-private storage
  • Shows both blocking and streaming inference patterns
  • Suitable for offline-first applications

Hugging Face Demo (HuggingFaceDemoActivity.kt)

  • Automated model download from Hugging Face Hub
  • Progress monitoring and cache management
  • Demonstrates proper error handling for network operations
  • Shows model reuse across application sessions

Retrieval-Augmented Generation

RAG Demo (RagActivity.kt)

  • Complete on-device RAG pipeline implementation
  • Document indexing with ONNX embeddings
  • Vector similarity search and context retrieval
  • Integration with SmolLM for answer generation
  • Demonstrates PDF parsing and text chunking strategies

Vision and Multimodal Processing

Image Text Extraction (ImageToTextActivity.kt)

  • Google ML Kit OCR integration
  • Batch image processing capabilities
  • Error handling for unsupported image formats
  • Demonstrates preprocessing for vision models

Vision Model Demo (LlavaVisionActivity.kt)

  • Vision-capable language model integration
  • Image-to-text description generation
  • Multimodal input preparation
  • Demonstrates vision model inference patterns

Generative Media

Image Generation (StableDiffusionActivity.kt)

  • Text-to-image synthesis using Stable Diffusion
  • LoRA Support: Toggle switch to apply Detail Tweaker LoRA, automatically downloaded from Hugging Face
  • EasyCache: Auto-enabled acceleration for supported models
  • Memory-aware configuration options
  • Progressive generation with cancellation support
  • Demonstrates VAE loading and tensor offloading strategies

Video Generation (VideoGenerationActivity.kt)

  • Text-to-video synthesis using Wan models
  • Multi-file model loading (main + VAE + T5XXL)
  • Device capability detection (12GB+ RAM required)
  • Frame-by-frame progress monitoring
  • Demonstrates proper resource cleanup

Speech Processing

Speech-to-Text (STT) (STTActivity.kt)

  • Whisper model download from Hugging Face
  • Audio recording and transcription
  • Real-time streaming transcription support
  • Timestamp and SRT generation

Text-to-Speech (TTS) (TTSActivity.kt)

  • Bark model download from Hugging Face via LLMEdgeManager
  • Text input for speech synthesis
  • Progress tracking during generation
  • Audio playback and WAV file saving
  • ARM-optimized native inference with OpenMP

System Requirements

Minimum Requirements

  • Android SDK 21+ (Lollipop)
  • 3GB RAM for basic LLM inference
  • 500MB free storage for model caching
  • 1GB+ free storage for speech models

Recommended Configuration

  • Android 11+ (API 30) for Vulkan acceleration
  • 8GB RAM for Stable Diffusion
  • 12GB+ RAM for video generation (Wan models)
  • 5GB free storage for video model pipeline

Speech Model Requirements

  • Whisper STT: 75MB-500MB depending on model size (tiny to small)
  • Bark TTS: 843MB for f16 models

Development Environment

  • Android SDK with NDK r27+
  • CMake 3.22+
  • Java 17+
  • Gradle 8.0+ (wrapper included)

Building the Application

Standard Build Process

From the repository root directory:

  1. Build the llmedge library:
./gradlew :llmedge:assembleRelease
  1. Copy the AAR to the examples project:
cp llmedge/build/outputs/aar/llmedge-release.aar llmedge-examples/app/libs/llmedge-release.aar
  1. Build the example application:
cd llmedge-examples
./gradlew :app:assembleDebug
  1. Install to device:
./gradlew :app:installDebug

Vulkan-Enabled Build

For GPU-accelerated inference on Android 11+ devices:

./gradlew :llmedge:assembleRelease \
  -Pandroid.jniCmakeArgs="-DGGML_VULKAN=ON -DSD_VULKAN=ON"

cp llmedge/build/outputs/aar/llmedge-release.aar llmedge-examples/app/libs/llmedge-release.aar

cd llmedge-examples
./gradlew :app:assembleDebug :app:installDebug

Note: Vulkan builds require devices with Vulkan 1.2 support (Android 11+).

Asset Configuration

Bundled GGUF Models

Place small GGUF models in app/src/main/assets/ for offline-first demos:

app/src/main/assets/
              └── models/
                  └── smolm2-360M-instruct.gguf

Recommended models for bundling:

  • SmolLM2-360M-Instruct (~200MB)
  • Qwen2-0.5B-Instruct (~300MB)
  • TinyLlama-1.1B (~600MB)

RAG Embeddings

The RAG demo requires ONNX embedding models:

app/src/main/assets/
              └── embeddings/
                  └── all-minilm-l6-v2/
                      ├── model.onnx
                      └── tokenizer.json

Download from: sentence-transformers/all-MiniLM-L6-v2 on Hugging Face

Runtime Model Cache

Models downloaded via Hugging Face are cached at:

<app_private_dir>/files/hf-models/<repo>/<revision>/<filename>

Cache persists across app restarts and is reused automatically.

Usage Examples

Basic LLM Inference

// Using the high-level Manager API
CoroutineScope(Dispatchers.IO).launch {
    val response = LLMEdgeManager.generateText(
        context = context,
        params = LLMEdgeManager.TextGenerationParams(
            prompt = "Explain quantum computing concisely.",
            modelId = "unsloth/Qwen3-0.6B-GGUF",
            modelFilename = "Qwen3-0.6B-Q4_K_M.gguf"
        )
    )
    
    withContext(Dispatchers.Main) {
        textView.text = response
    }
}

RAG Pipeline

// Access the underlying SmolLM instance from the manager for custom pipelines
val smol = LLMEdgeManager.getSmolLM(context)
val rag = RAGEngine(context, smol)

CoroutineScope(Dispatchers.IO).launch {
    rag.init()
    val chunks = rag.indexPdf(pdfUri)
    val answer = rag.ask("What are the main conclusions?")

    withContext(Dispatchers.Main) {
        resultView.text = answer
    }
}

Speech-to-Text (Whisper)

import io.aatricks.llmedge.LLMEdgeManager

CoroutineScope(Dispatchers.IO).launch {
    // Simple transcription
    val text = LLMEdgeManager.transcribeAudioToText(
        context = context,
        audioSamples = audioSamples  // 16kHz mono PCM float32
    )

    // Full transcription with timing
    val segments = LLMEdgeManager.transcribeAudio(
        context = context,
        params = LLMEdgeManager.TranscriptionParams(
            audioSamples = audioSamples,
            language = "en"
        )
    ) { progress ->
        Log.d("Whisper", "Progress: $progress%")
    }

    withContext(Dispatchers.Main) {
        segments.forEach { segment ->
            textView.append("[${segment.startTimeMs}ms] ${segment.text}\n")
        }
    }
}

Real-time Streaming Transcription

For live captioning from a microphone:

import io.aatricks.llmedge.LLMEdgeManager

class LiveCaptionActivity : AppCompatActivity() {
    private var transcriber: Whisper.StreamingTranscriber? = null

    fun startLiveCaptions() {
        lifecycleScope.launch(Dispatchers.IO) {
            // Create streaming transcriber with sliding window
            transcriber = LLMEdgeManager.createStreamingTranscriber(
                context = this@LiveCaptionActivity,
                params = LLMEdgeManager.StreamingTranscriptionParams(
                    stepMs = 3000,      // Process every 3 seconds
                    lengthMs = 10000,   // 10-second windows
                    language = "en",
                    useVad = true       // Skip silent audio
                )
            )

            // Collect transcription results
            transcriber?.start()?.collect { segment ->
                withContext(Dispatchers.Main) {
                    captionTextView.text = segment.text
                }
            }
        }
    }

    // Feed audio from microphone (called by AudioRecord callback)
    fun onAudioData(samples: FloatArray) {
        lifecycleScope.launch(Dispatchers.IO) {
            transcriber?.feedAudio(samples)
        }
    }

    fun stopLiveCaptions() {
        transcriber?.stop()
        LLMEdgeManager.stopStreamingTranscription()
    }
}

Text-to-Speech (Bark)

import io.aatricks.llmedge.LLMEdgeManager

CoroutineScope(Dispatchers.IO).launch {
    // Generate speech (model auto-downloads on first use)
    val audio = LLMEdgeManager.synthesizeSpeech(
        context = context,
        params = LLMEdgeManager.SpeechSynthesisParams(
            text = "Hello, world!",
            nThreads = 8  // Use more threads for faster generation
        )
    ) { step, progress ->
        Log.d("Bark", "${step.name}: $progress%")
    }

    // Or save directly to file
    val outputFile = File(context.cacheDir, "output.wav")
    LLMEdgeManager.synthesizeSpeechToFile(
        context = context,
        text = "Hello, world!",
        outputFile = outputFile
    )

    // Unload when done
    LLMEdgeManager.unloadSpeechModels()
}

Image Generation

val bitmap = LLMEdgeManager.generateImage(
    context = this,
    params = LLMEdgeManager.ImageGenerationParams(
        prompt = "serene mountain landscape, sunset",
        width = 512,
        height = 512,
        steps = 20
    )
)

imageView.setImageBitmap(bitmap)

Video Generation

// Automatic memory management and sequential loading
val frames = LLMEdgeManager.generateVideo(
    context = this,
    params = LLMEdgeManager.VideoGenerationParams(
        prompt = "cat walking through garden",
        videoFrames = 8,
        width = 512,
        height = 512,
        steps = 20,
        cfgScale = 7.0f,
        flowShift = 3.0f,
        forceSequentialLoad = true // Safe for most devices
    )
) { status, current, total ->
    Log.d("VideoGen", "$status")
}

Performance Optimization

Memory Management

Monitor Memory Usage:

val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", "Native heap: ${snapshot.nativePssKb / 1024}MB")

Optimization Strategies:

  • Use quantized models (Q4_K_M) for lower memory footprint
  • Enable CPU offloading for large models
  • Close model instances when not in use
  • Process images/video in batches with intermediate cleanup

Thread Configuration

val params = SmolLM.InferenceParams(
    numThreads = Runtime.getRuntime().availableProcessors(),
    contextSize = 2048  // Adjust based on device RAM
)

Vulkan Acceleration

Verify Vulkan availability:

if (SmolLM.isVulkanEnabled()) {
    Log.i("Performance", "Vulkan backend active")
} else {
    Log.w("Performance", "Falling back to CPU backend")
}

Check logcat for initialization:

adb logcat -s SmolLM:* SmolSD:* | grep -i vulkan

Troubleshooting

Model Loading Failures

Symptoms: FileNotFoundException, IllegalStateException during load

Solutions:

  • Verify model file exists in expected location
  • Check available storage space
  • Ensure network connectivity for Hugging Face downloads
  • Validate model file integrity (not corrupted)

Out of Memory Errors

Symptoms: App crashes with OOM during inference or generation

Solutions:

  • Use smaller models or quantized variants
  • Reduce image/video resolution
  • Enable CPU offloading: offloadToCpu = true
  • Lower context window size
  • Close unused model instances

Slow Inference Performance

Symptoms: Generation takes excessive time per token/frame

Solutions:

  • Use quantized models (Q4_K_M, Q3_K_S)
  • Reduce inference steps (15-20 is usually sufficient)
  • Enable Vulkan on compatible devices
  • Adjust thread count to match device cores
  • Use smaller resolutions for media generation

Video Generation Failures

Symptoms: Crashes or errors when loading Wan models

Solutions:

  • Verify device has 12GB+ RAM
  • Ensure all three files downloaded (main + VAE + T5XXL)
  • Use explicit file paths (not modelId shorthand)
  • Check stable-diffusion.cpp logs in logcat
  • Verify sufficient storage for 6GB+ model files

Native Library Issues

Symptoms: UnsatisfiedLinkError, native crashes

Solutions:

  • Rebuild AAR and reinstall app
  • Verify NDK version matches (r27+)
  • Check device ABI compatibility
  • Inspect logcat for native stack traces
  • Clean build: ./gradlew clean

Speech Processing Issues

Symptoms: Whisper transcription crashing or producing garbled output

Solutions:

  • Ensure audio is 16kHz mono PCM float32 format
  • Use smaller models (tiny/base) for faster processing
  • Check that model file downloaded completely

Testing Infrastructure

Speech E2E Testing

Run speech tests via adb:

adb shell am instrument -w -e class com.example.llmedgeexample.SpeechE2ETest \
  com.example.llmedgeexample.test/androidx.test.runner.AndroidJUnitRunner

Headless E2E Testing

Run automated video generation tests:

adb shell am start -n com.example.llmedgeexample/.HeadlessVideoTestActivity

Monitor test execution:

adb logcat -s VideoE2E:*

Test results are logged to logcat with detailed timing and validation metrics.

Architecture Notes

Memory Architecture

  • Native models allocated via JNI in native heap
  • Dalvik heap used only for Java objects and bitmaps
  • Large file downloads use system DownloadManager
  • Tensor operations execute in native memory space

Threading Model

  • All model operations run on background threads (Dispatchers.IO)
  • UI updates dispatched to Main thread
  • Blocking calls avoided on UI thread
  • Coroutines used for structured concurrency

Resource Lifecycle

  • Models implement AutoCloseable for automatic cleanup
  • Native resources freed via close() method
  • File handles managed with try-with-resources pattern
  • Memory mapped files used for large model loading

License

Apache 2.0 - See LICENSE file for details

Contributing

Contributions are welcome. Please review the main repository's contributing guidelines before submitting pull requests.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors