llmedge Examples

Comprehensive demonstration applications for the llmedge Android library, showcasing on-device language model inference, RAG pipelines, image generation, and video synthesis capabilities.

Main Library Repository: https://github.com/Aatricks/llmedge

Overview

This example application provides production-ready demonstrations of llmedge's core features. Each activity is designed to illustrate best practices for model loading, memory management, and efficient on-device inference.

Included Demonstrations

Language Model Inference

Local Asset Demo (LocalAssetDemoActivity.kt)

Demonstrates loading GGUF models bundled within the APK
Illustrates asset extraction to app-private storage
Shows both blocking and streaming inference patterns
Suitable for offline-first applications

Hugging Face Demo (HuggingFaceDemoActivity.kt)

Automated model download from Hugging Face Hub
Progress monitoring and cache management
Demonstrates proper error handling for network operations
Shows model reuse across application sessions

Retrieval-Augmented Generation

RAG Demo (RagActivity.kt)

Complete on-device RAG pipeline implementation
Document indexing with ONNX embeddings
Vector similarity search and context retrieval
Integration with SmolLM for answer generation
Demonstrates PDF parsing and text chunking strategies

Vision and Multimodal Processing

Image Text Extraction (ImageToTextActivity.kt)

Google ML Kit OCR integration
Batch image processing capabilities
Error handling for unsupported image formats
Demonstrates preprocessing for vision models

Vision Model Demo (LlavaVisionActivity.kt)

Vision-capable language model integration
Image-to-text description generation
Multimodal input preparation
Demonstrates vision model inference patterns

Generative Media

Image Generation (StableDiffusionActivity.kt)

Text-to-image synthesis using Stable Diffusion
LoRA Support: Toggle switch to apply Detail Tweaker LoRA, automatically downloaded from Hugging Face
EasyCache: Auto-enabled acceleration for supported models
Memory-aware configuration options
Progressive generation with cancellation support
Demonstrates VAE loading and tensor offloading strategies

Video Generation (VideoGenerationActivity.kt)

Text-to-video synthesis using Wan models
Multi-file model loading (main + VAE + T5XXL)
Device capability detection (12GB+ RAM required)
Frame-by-frame progress monitoring
Demonstrates proper resource cleanup

Speech Processing

Speech-to-Text (STT) (STTActivity.kt)

Whisper model download from Hugging Face
Audio recording and transcription
Real-time streaming transcription support
Timestamp and SRT generation

Text-to-Speech (TTS) (TTSActivity.kt)

Bark model download from Hugging Face via LLMEdgeManager
Text input for speech synthesis
Progress tracking during generation
Audio playback and WAV file saving
ARM-optimized native inference with OpenMP

System Requirements

Minimum Requirements

Android SDK 21+ (Lollipop)
3GB RAM for basic LLM inference
500MB free storage for model caching
1GB+ free storage for speech models

Recommended Configuration

Android 11+ (API 30) for Vulkan acceleration
8GB RAM for Stable Diffusion
12GB+ RAM for video generation (Wan models)
5GB free storage for video model pipeline

Speech Model Requirements

Whisper STT: 75MB-500MB depending on model size (tiny to small)
Bark TTS: 843MB for f16 models

Development Environment

Android SDK with NDK r27+
CMake 3.22+
Java 17+
Gradle 8.0+ (wrapper included)

Building the Application

Standard Build Process

From the repository root directory:

Build the llmedge library:

./gradlew :llmedge:assembleRelease

Copy the AAR to the examples project:

cp llmedge/build/outputs/aar/llmedge-release.aar llmedge-examples/app/libs/llmedge-release.aar

Build the example application:

cd llmedge-examples
./gradlew :app:assembleDebug

Install to device:

./gradlew :app:installDebug

Vulkan-Enabled Build

For GPU-accelerated inference on Android 11+ devices:

./gradlew :llmedge:assembleRelease \
  -Pandroid.jniCmakeArgs="-DGGML_VULKAN=ON -DSD_VULKAN=ON"

cp llmedge/build/outputs/aar/llmedge-release.aar llmedge-examples/app/libs/llmedge-release.aar

cd llmedge-examples
./gradlew :app:assembleDebug :app:installDebug

Note: Vulkan builds require devices with Vulkan 1.2 support (Android 11+).

Asset Configuration

Bundled GGUF Models

Place small GGUF models in app/src/main/assets/ for offline-first demos:

app/src/main/assets/
              └── models/
                  └── smolm2-360M-instruct.gguf

Recommended models for bundling:

SmolLM2-360M-Instruct (~200MB)
Qwen2-0.5B-Instruct (~300MB)
TinyLlama-1.1B (~600MB)

RAG Embeddings

The RAG demo requires ONNX embedding models:

app/src/main/assets/
              └── embeddings/
                  └── all-minilm-l6-v2/
                      ├── model.onnx
                      └── tokenizer.json

Download from: sentence-transformers/all-MiniLM-L6-v2 on Hugging Face

Runtime Model Cache

Models downloaded via Hugging Face are cached at:

<app_private_dir>/files/hf-models/<repo>/<revision>/<filename>

Cache persists across app restarts and is reused automatically.

Usage Examples

Basic LLM Inference

// Using the high-level Manager API
CoroutineScope(Dispatchers.IO).launch {
    val response = LLMEdgeManager.generateText(
        context = context,
        params = LLMEdgeManager.TextGenerationParams(
            prompt = "Explain quantum computing concisely.",
            modelId = "unsloth/Qwen3-0.6B-GGUF",
            modelFilename = "Qwen3-0.6B-Q4_K_M.gguf"
        )
    )
    
    withContext(Dispatchers.Main) {
        textView.text = response
    }
}

RAG Pipeline

// Access the underlying SmolLM instance from the manager for custom pipelines
val smol = LLMEdgeManager.getSmolLM(context)
val rag = RAGEngine(context, smol)

CoroutineScope(Dispatchers.IO).launch {
    rag.init()
    val chunks = rag.indexPdf(pdfUri)
    val answer = rag.ask("What are the main conclusions?")

    withContext(Dispatchers.Main) {
        resultView.text = answer
    }
}

Speech-to-Text (Whisper)

import io.aatricks.llmedge.LLMEdgeManager

CoroutineScope(Dispatchers.IO).launch {
    // Simple transcription
    val text = LLMEdgeManager.transcribeAudioToText(
        context = context,
        audioSamples = audioSamples  // 16kHz mono PCM float32
    )

    // Full transcription with timing
    val segments = LLMEdgeManager.transcribeAudio(
        context = context,
        params = LLMEdgeManager.TranscriptionParams(
            audioSamples = audioSamples,
            language = "en"
        )
    ) { progress ->
        Log.d("Whisper", "Progress: $progress%")
    }

    withContext(Dispatchers.Main) {
        segments.forEach { segment ->
            textView.append("[${segment.startTimeMs}ms] ${segment.text}\n")
        }
    }
}

Real-time Streaming Transcription

For live captioning from a microphone:

import io.aatricks.llmedge.LLMEdgeManager

class LiveCaptionActivity : AppCompatActivity() {
    private var transcriber: Whisper.StreamingTranscriber? = null

    fun startLiveCaptions() {
        lifecycleScope.launch(Dispatchers.IO) {
            // Create streaming transcriber with sliding window
            transcriber = LLMEdgeManager.createStreamingTranscriber(
                context = this@LiveCaptionActivity,
                params = LLMEdgeManager.StreamingTranscriptionParams(
                    stepMs = 3000,      // Process every 3 seconds
                    lengthMs = 10000,   // 10-second windows
                    language = "en",
                    useVad = true       // Skip silent audio
                )
            )

            // Collect transcription results
            transcriber?.start()?.collect { segment ->
                withContext(Dispatchers.Main) {
                    captionTextView.text = segment.text
                }
            }
        }
    }

    // Feed audio from microphone (called by AudioRecord callback)
    fun onAudioData(samples: FloatArray) {
        lifecycleScope.launch(Dispatchers.IO) {
            transcriber?.feedAudio(samples)
        }
    }

    fun stopLiveCaptions() {
        transcriber?.stop()
        LLMEdgeManager.stopStreamingTranscription()
    }
}

Text-to-Speech (Bark)

import io.aatricks.llmedge.LLMEdgeManager

CoroutineScope(Dispatchers.IO).launch {
    // Generate speech (model auto-downloads on first use)
    val audio = LLMEdgeManager.synthesizeSpeech(
        context = context,
        params = LLMEdgeManager.SpeechSynthesisParams(
            text = "Hello, world!",
            nThreads = 8  // Use more threads for faster generation
        )
    ) { step, progress ->
        Log.d("Bark", "${step.name}: $progress%")
    }

    // Or save directly to file
    val outputFile = File(context.cacheDir, "output.wav")
    LLMEdgeManager.synthesizeSpeechToFile(
        context = context,
        text = "Hello, world!",
        outputFile = outputFile
    )

    // Unload when done
    LLMEdgeManager.unloadSpeechModels()
}

Image Generation

val bitmap = LLMEdgeManager.generateImage(
    context = this,
    params = LLMEdgeManager.ImageGenerationParams(
        prompt = "serene mountain landscape, sunset",
        width = 512,
        height = 512,
        steps = 20
    )
)

imageView.setImageBitmap(bitmap)

Video Generation

// Automatic memory management and sequential loading
val frames = LLMEdgeManager.generateVideo(
    context = this,
    params = LLMEdgeManager.VideoGenerationParams(
        prompt = "cat walking through garden",
        videoFrames = 8,
        width = 512,
        height = 512,
        steps = 20,
        cfgScale = 7.0f,
        flowShift = 3.0f,
        forceSequentialLoad = true // Safe for most devices
    )
) { status, current, total ->
    Log.d("VideoGen", "$status")
}

Performance Optimization

Memory Management

Monitor Memory Usage:

val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", "Native heap: ${snapshot.nativePssKb / 1024}MB")

Optimization Strategies:

Use quantized models (Q4_K_M) for lower memory footprint
Enable CPU offloading for large models
Close model instances when not in use
Process images/video in batches with intermediate cleanup

Thread Configuration

val params = SmolLM.InferenceParams(
    numThreads = Runtime.getRuntime().availableProcessors(),
    contextSize = 2048  // Adjust based on device RAM
)

Vulkan Acceleration

Verify Vulkan availability:

if (SmolLM.isVulkanEnabled()) {
    Log.i("Performance", "Vulkan backend active")
} else {
    Log.w("Performance", "Falling back to CPU backend")
}

Check logcat for initialization:

adb logcat -s SmolLM:* SmolSD:* | grep -i vulkan

Troubleshooting

Model Loading Failures

Symptoms: FileNotFoundException, IllegalStateException during load

Solutions:

Verify model file exists in expected location
Check available storage space
Ensure network connectivity for Hugging Face downloads
Validate model file integrity (not corrupted)

Out of Memory Errors

Symptoms: App crashes with OOM during inference or generation

Solutions:

Use smaller models or quantized variants
Reduce image/video resolution
Enable CPU offloading: offloadToCpu = true
Lower context window size
Close unused model instances

Slow Inference Performance

Symptoms: Generation takes excessive time per token/frame

Solutions:

Use quantized models (Q4_K_M, Q3_K_S)
Reduce inference steps (15-20 is usually sufficient)
Enable Vulkan on compatible devices
Adjust thread count to match device cores
Use smaller resolutions for media generation

Video Generation Failures

Symptoms: Crashes or errors when loading Wan models

Solutions:

Verify device has 12GB+ RAM
Ensure all three files downloaded (main + VAE + T5XXL)
Use explicit file paths (not modelId shorthand)
Check stable-diffusion.cpp logs in logcat
Verify sufficient storage for 6GB+ model files

Native Library Issues

Symptoms: UnsatisfiedLinkError, native crashes

Solutions:

Rebuild AAR and reinstall app
Verify NDK version matches (r27+)
Check device ABI compatibility
Inspect logcat for native stack traces
Clean build: ./gradlew clean

Speech Processing Issues

Symptoms: Whisper transcription crashing or producing garbled output

Solutions:

Ensure audio is 16kHz mono PCM float32 format
Use smaller models (tiny/base) for faster processing
Check that model file downloaded completely

Testing Infrastructure

Speech E2E Testing

Run speech tests via adb:

adb shell am instrument -w -e class com.example.llmedgeexample.SpeechE2ETest \
  com.example.llmedgeexample.test/androidx.test.runner.AndroidJUnitRunner

Headless E2E Testing

Run automated video generation tests:

adb shell am start -n com.example.llmedgeexample/.HeadlessVideoTestActivity

Monitor test execution:

adb logcat -s VideoE2E:*

Test results are logged to logcat with detailed timing and validation metrics.

Architecture Notes

Memory Architecture

Native models allocated via JNI in native heap
Dalvik heap used only for Java objects and bitmaps
Large file downloads use system DownloadManager
Tensor operations execute in native memory space

Threading Model

All model operations run on background threads (Dispatchers.IO)
UI updates dispatched to Main thread
Blocking calls avoided on UI thread
Coroutines used for structured concurrency

Resource Lifecycle

Models implement AutoCloseable for automatic cleanup
Native resources freed via close() method
File handles managed with try-with-resources pattern
Memory mapped files used for large model loading

License

Apache 2.0 - See LICENSE file for details

Contributing

Contributions are welcome. Please review the main repository's contributing guidelines before submitting pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
gradle/wrapper		gradle/wrapper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Folders and files

Latest commit

History

Repository files navigation

llmedge Examples

Overview

Included Demonstrations

Language Model Inference

Retrieval-Augmented Generation

Vision and Multimodal Processing

Generative Media

Speech Processing

System Requirements

Minimum Requirements

Recommended Configuration

Speech Model Requirements

Development Environment

Building the Application

Standard Build Process

Vulkan-Enabled Build

Asset Configuration

Bundled GGUF Models

RAG Embeddings

Runtime Model Cache

Usage Examples

Basic LLM Inference

RAG Pipeline

Speech-to-Text (Whisper)

Real-time Streaming Transcription

Text-to-Speech (Bark)

Image Generation

Video Generation

Performance Optimization

Memory Management

Thread Configuration

Vulkan Acceleration

Troubleshooting

Model Loading Failures

Out of Memory Errors

Slow Inference Performance

Video Generation Failures

Native Library Issues

Speech Processing Issues

Testing Infrastructure

Speech E2E Testing

Headless E2E Testing

Architecture Notes

Memory Architecture

Threading Model

Resource Lifecycle

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages