Comprehensive demonstration applications for the llmedge Android library, showcasing on-device language model inference, RAG pipelines, image generation, and video synthesis capabilities.
Main Library Repository: https://github.com/Aatricks/llmedge
This example application provides production-ready demonstrations of llmedge's core features. Each activity is designed to illustrate best practices for model loading, memory management, and efficient on-device inference.
Local Asset Demo (LocalAssetDemoActivity.kt)
- Demonstrates loading GGUF models bundled within the APK
- Illustrates asset extraction to app-private storage
- Shows both blocking and streaming inference patterns
- Suitable for offline-first applications
Hugging Face Demo (HuggingFaceDemoActivity.kt)
- Automated model download from Hugging Face Hub
- Progress monitoring and cache management
- Demonstrates proper error handling for network operations
- Shows model reuse across application sessions
RAG Demo (RagActivity.kt)
- Complete on-device RAG pipeline implementation
- Document indexing with ONNX embeddings
- Vector similarity search and context retrieval
- Integration with SmolLM for answer generation
- Demonstrates PDF parsing and text chunking strategies
Image Text Extraction (ImageToTextActivity.kt)
- Google ML Kit OCR integration
- Batch image processing capabilities
- Error handling for unsupported image formats
- Demonstrates preprocessing for vision models
Vision Model Demo (LlavaVisionActivity.kt)
- Vision-capable language model integration
- Image-to-text description generation
- Multimodal input preparation
- Demonstrates vision model inference patterns
Image Generation (StableDiffusionActivity.kt)
- Text-to-image synthesis using Stable Diffusion
- LoRA Support: Toggle switch to apply Detail Tweaker LoRA, automatically downloaded from Hugging Face
- EasyCache: Auto-enabled acceleration for supported models
- Memory-aware configuration options
- Progressive generation with cancellation support
- Demonstrates VAE loading and tensor offloading strategies
Video Generation (VideoGenerationActivity.kt)
- Text-to-video synthesis using Wan models
- Multi-file model loading (main + VAE + T5XXL)
- Device capability detection (12GB+ RAM required)
- Frame-by-frame progress monitoring
- Demonstrates proper resource cleanup
Speech-to-Text (STT) (STTActivity.kt)
- Whisper model download from Hugging Face
- Audio recording and transcription
- Real-time streaming transcription support
- Timestamp and SRT generation
Text-to-Speech (TTS) (TTSActivity.kt)
- Bark model download from Hugging Face via LLMEdgeManager
- Text input for speech synthesis
- Progress tracking during generation
- Audio playback and WAV file saving
- ARM-optimized native inference with OpenMP
- Android SDK 21+ (Lollipop)
- 3GB RAM for basic LLM inference
- 500MB free storage for model caching
- 1GB+ free storage for speech models
- Android 11+ (API 30) for Vulkan acceleration
- 8GB RAM for Stable Diffusion
- 12GB+ RAM for video generation (Wan models)
- 5GB free storage for video model pipeline
- Whisper STT: 75MB-500MB depending on model size (tiny to small)
- Bark TTS: 843MB for f16 models
- Android SDK with NDK r27+
- CMake 3.22+
- Java 17+
- Gradle 8.0+ (wrapper included)
From the repository root directory:
- Build the llmedge library:
./gradlew :llmedge:assembleRelease- Copy the AAR to the examples project:
cp llmedge/build/outputs/aar/llmedge-release.aar llmedge-examples/app/libs/llmedge-release.aar- Build the example application:
cd llmedge-examples
./gradlew :app:assembleDebug- Install to device:
./gradlew :app:installDebugFor GPU-accelerated inference on Android 11+ devices:
./gradlew :llmedge:assembleRelease \
-Pandroid.jniCmakeArgs="-DGGML_VULKAN=ON -DSD_VULKAN=ON"
cp llmedge/build/outputs/aar/llmedge-release.aar llmedge-examples/app/libs/llmedge-release.aar
cd llmedge-examples
./gradlew :app:assembleDebug :app:installDebugNote: Vulkan builds require devices with Vulkan 1.2 support (Android 11+).
Place small GGUF models in app/src/main/assets/ for offline-first demos:
app/src/main/assets/
└── models/
└── smolm2-360M-instruct.gguf
Recommended models for bundling:
- SmolLM2-360M-Instruct (~200MB)
- Qwen2-0.5B-Instruct (~300MB)
- TinyLlama-1.1B (~600MB)
The RAG demo requires ONNX embedding models:
app/src/main/assets/
└── embeddings/
└── all-minilm-l6-v2/
├── model.onnx
└── tokenizer.json
Download from: sentence-transformers/all-MiniLM-L6-v2 on Hugging Face
Models downloaded via Hugging Face are cached at:
<app_private_dir>/files/hf-models/<repo>/<revision>/<filename>
Cache persists across app restarts and is reused automatically.
// Using the high-level Manager API
CoroutineScope(Dispatchers.IO).launch {
val response = LLMEdgeManager.generateText(
context = context,
params = LLMEdgeManager.TextGenerationParams(
prompt = "Explain quantum computing concisely.",
modelId = "unsloth/Qwen3-0.6B-GGUF",
modelFilename = "Qwen3-0.6B-Q4_K_M.gguf"
)
)
withContext(Dispatchers.Main) {
textView.text = response
}
}// Access the underlying SmolLM instance from the manager for custom pipelines
val smol = LLMEdgeManager.getSmolLM(context)
val rag = RAGEngine(context, smol)
CoroutineScope(Dispatchers.IO).launch {
rag.init()
val chunks = rag.indexPdf(pdfUri)
val answer = rag.ask("What are the main conclusions?")
withContext(Dispatchers.Main) {
resultView.text = answer
}
}import io.aatricks.llmedge.LLMEdgeManager
CoroutineScope(Dispatchers.IO).launch {
// Simple transcription
val text = LLMEdgeManager.transcribeAudioToText(
context = context,
audioSamples = audioSamples // 16kHz mono PCM float32
)
// Full transcription with timing
val segments = LLMEdgeManager.transcribeAudio(
context = context,
params = LLMEdgeManager.TranscriptionParams(
audioSamples = audioSamples,
language = "en"
)
) { progress ->
Log.d("Whisper", "Progress: $progress%")
}
withContext(Dispatchers.Main) {
segments.forEach { segment ->
textView.append("[${segment.startTimeMs}ms] ${segment.text}\n")
}
}
}For live captioning from a microphone:
import io.aatricks.llmedge.LLMEdgeManager
class LiveCaptionActivity : AppCompatActivity() {
private var transcriber: Whisper.StreamingTranscriber? = null
fun startLiveCaptions() {
lifecycleScope.launch(Dispatchers.IO) {
// Create streaming transcriber with sliding window
transcriber = LLMEdgeManager.createStreamingTranscriber(
context = this@LiveCaptionActivity,
params = LLMEdgeManager.StreamingTranscriptionParams(
stepMs = 3000, // Process every 3 seconds
lengthMs = 10000, // 10-second windows
language = "en",
useVad = true // Skip silent audio
)
)
// Collect transcription results
transcriber?.start()?.collect { segment ->
withContext(Dispatchers.Main) {
captionTextView.text = segment.text
}
}
}
}
// Feed audio from microphone (called by AudioRecord callback)
fun onAudioData(samples: FloatArray) {
lifecycleScope.launch(Dispatchers.IO) {
transcriber?.feedAudio(samples)
}
}
fun stopLiveCaptions() {
transcriber?.stop()
LLMEdgeManager.stopStreamingTranscription()
}
}import io.aatricks.llmedge.LLMEdgeManager
CoroutineScope(Dispatchers.IO).launch {
// Generate speech (model auto-downloads on first use)
val audio = LLMEdgeManager.synthesizeSpeech(
context = context,
params = LLMEdgeManager.SpeechSynthesisParams(
text = "Hello, world!",
nThreads = 8 // Use more threads for faster generation
)
) { step, progress ->
Log.d("Bark", "${step.name}: $progress%")
}
// Or save directly to file
val outputFile = File(context.cacheDir, "output.wav")
LLMEdgeManager.synthesizeSpeechToFile(
context = context,
text = "Hello, world!",
outputFile = outputFile
)
// Unload when done
LLMEdgeManager.unloadSpeechModels()
}val bitmap = LLMEdgeManager.generateImage(
context = this,
params = LLMEdgeManager.ImageGenerationParams(
prompt = "serene mountain landscape, sunset",
width = 512,
height = 512,
steps = 20
)
)
imageView.setImageBitmap(bitmap)// Automatic memory management and sequential loading
val frames = LLMEdgeManager.generateVideo(
context = this,
params = LLMEdgeManager.VideoGenerationParams(
prompt = "cat walking through garden",
videoFrames = 8,
width = 512,
height = 512,
steps = 20,
cfgScale = 7.0f,
flowShift = 3.0f,
forceSequentialLoad = true // Safe for most devices
)
) { status, current, total ->
Log.d("VideoGen", "$status")
}Monitor Memory Usage:
val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", "Native heap: ${snapshot.nativePssKb / 1024}MB")Optimization Strategies:
- Use quantized models (Q4_K_M) for lower memory footprint
- Enable CPU offloading for large models
- Close model instances when not in use
- Process images/video in batches with intermediate cleanup
val params = SmolLM.InferenceParams(
numThreads = Runtime.getRuntime().availableProcessors(),
contextSize = 2048 // Adjust based on device RAM
)Verify Vulkan availability:
if (SmolLM.isVulkanEnabled()) {
Log.i("Performance", "Vulkan backend active")
} else {
Log.w("Performance", "Falling back to CPU backend")
}Check logcat for initialization:
adb logcat -s SmolLM:* SmolSD:* | grep -i vulkanSymptoms: FileNotFoundException, IllegalStateException during load
Solutions:
- Verify model file exists in expected location
- Check available storage space
- Ensure network connectivity for Hugging Face downloads
- Validate model file integrity (not corrupted)
Symptoms: App crashes with OOM during inference or generation
Solutions:
- Use smaller models or quantized variants
- Reduce image/video resolution
- Enable CPU offloading:
offloadToCpu = true - Lower context window size
- Close unused model instances
Symptoms: Generation takes excessive time per token/frame
Solutions:
- Use quantized models (Q4_K_M, Q3_K_S)
- Reduce inference steps (15-20 is usually sufficient)
- Enable Vulkan on compatible devices
- Adjust thread count to match device cores
- Use smaller resolutions for media generation
Symptoms: Crashes or errors when loading Wan models
Solutions:
- Verify device has 12GB+ RAM
- Ensure all three files downloaded (main + VAE + T5XXL)
- Use explicit file paths (not modelId shorthand)
- Check stable-diffusion.cpp logs in logcat
- Verify sufficient storage for 6GB+ model files
Symptoms: UnsatisfiedLinkError, native crashes
Solutions:
- Rebuild AAR and reinstall app
- Verify NDK version matches (r27+)
- Check device ABI compatibility
- Inspect logcat for native stack traces
- Clean build:
./gradlew clean
Symptoms: Whisper transcription crashing or producing garbled output
Solutions:
- Ensure audio is 16kHz mono PCM float32 format
- Use smaller models (tiny/base) for faster processing
- Check that model file downloaded completely
Run speech tests via adb:
adb shell am instrument -w -e class com.example.llmedgeexample.SpeechE2ETest \
com.example.llmedgeexample.test/androidx.test.runner.AndroidJUnitRunnerRun automated video generation tests:
adb shell am start -n com.example.llmedgeexample/.HeadlessVideoTestActivityMonitor test execution:
adb logcat -s VideoE2E:*Test results are logged to logcat with detailed timing and validation metrics.
- Native models allocated via JNI in native heap
- Dalvik heap used only for Java objects and bitmaps
- Large file downloads use system DownloadManager
- Tensor operations execute in native memory space
- All model operations run on background threads (Dispatchers.IO)
- UI updates dispatched to Main thread
- Blocking calls avoided on UI thread
- Coroutines used for structured concurrency
- Models implement
AutoCloseablefor automatic cleanup - Native resources freed via
close()method - File handles managed with try-with-resources pattern
- Memory mapped files used for large model loading
Apache 2.0 - See LICENSE file for details
Contributions are welcome. Please review the main repository's contributing guidelines before submitting pull requests.