Skip to content

The VLM Framework is an extensible, production-ready system for working with Vision-Language Models. It provides a unified interface for different VLM models while maintaining flexibility for future extensions.

Notifications You must be signed in to change notification settings

songminkyu/multimodal-benchmark

Repository files navigation

VLM Framework - Project Structure Documentation

🏗️ Architecture Overview

The VLM Framework is an extensible, production-ready system for working with Vision-Language Models. It provides a unified interface for different VLM models while maintaining flexibility for future extensions.

🎯 Design Principles

  • Extensibility: Easy to add new VLM models beyond Qwen-Image
  • Modularity: Clean separation of concerns with abstract interfaces
  • Performance: Optimized for GPU/CPU usage with memory management
  • Usability: Simple API for both beginners and advanced users
  • Configuration-Driven: YAML-based configuration system

📁 Directory Structure

vlm/
├── vlm_framework/                 # Core framework package
│   ├── __init__.py               # Package exports and version info
│   ├── core/                     # Core framework components
│   │   ├── __init__.py          
│   │   ├── base_model.py        # Abstract base class for all models
│   │   └── model_factory.py     # Factory pattern for model creation
│   ├── models/                   # Model implementations
│   │   ├── __init__.py
│   │   ├── qwen_image.py        # Qwen-Image text-to-image model
│   │   └── qwen_image_edit.py   # Qwen-Image-Edit editing model
│   └── utils/                    # Utility modules
│       ├── __init__.py
│       ├── config_loader.py     # Configuration management
│       └── device_utils.py      # Device/memory management
├── assets/                       # Demo images and resources
│   └── [various image files]
├── main.py                      # Main testing and demo script
├── config.yaml                 # Framework configuration
├── requirements.txt            # Python dependencies
├── CLAUDE.md                   # Claude Code guidance
├── PROJECT_STRUCTURE.md        # This file
└── README.md                   # Original Qwen-Image documentation

🧩 Core Components

1. Base Model Interface (core/base_model.py)

Purpose: Defines the contract that all VLM models must implement.

Key Classes:

  • BaseVLMModel: Abstract base class with common interface
  • GenerationResult: Container for model outputs with metadata
  • ModelCapabilities: Describes what each model can do

Key Methods:

# Model lifecycle
load_model(device, dtype) -> bool
unload_model() -> bool

# Core generation interface
generate(**kwargs) -> GenerationResult

# Introspection
capabilities -> ModelCapabilities
model_type -> str
is_loaded() -> bool

2. Model Factory (core/model_factory.py)

Purpose: Centralized model creation and management using Factory pattern.

Features:

  • Automatic model registration system
  • Singleton model instances with caching
  • Model discovery and introspection
  • Memory management for multiple models

Usage:

# Get a model instance
factory = ModelFactory()
model = factory.get_or_create_model('qwen_image')

# List available models
models = factory.get_available_models()

3. Configuration System (utils/config_loader.py)

Purpose: YAML-based configuration management with validation.

Features:

  • Hierarchical configuration (framework → models → specific settings)
  • Default configuration generation
  • Runtime configuration updates
  • Validation and error handling

Configuration Structure:

framework:          # Framework-level settings
models:            # Model-specific configurations
  qwen_image:      # Qwen-Image settings
    model_name: "Qwen/Qwen-Image"
    default_params: {...}
    aspect_ratios: {...}
hardware:          # Hardware optimization settings
output:            # Output formatting settings
logging:           # Logging configuration

4. Device Management (utils/device_utils.py)

Purpose: Intelligent device selection and memory optimization.

Features:

  • Auto-detection of optimal device (CUDA/MPS/CPU)
  • Memory requirement checking
  • Performance optimization settings
  • Memory cache management

🤖 Model Implementations

Qwen-Image (models/qwen_image.py)

Capabilities:

  • ✅ Text-to-image generation
  • ✅ Multiple aspect ratios
  • ✅ Language-specific prompt enhancement
  • ✅ LoRA support
  • ✅ Batch processing

Key Features:

  • Magic prompt enhancement for English/Chinese
  • Predefined aspect ratio templates
  • Comprehensive parameter validation
  • Memory-optimized inference

Qwen-Image-Edit (models/qwen_image_edit.py)

Capabilities:

  • ✅ Image editing with text prompts
  • ✅ Style transfer
  • ✅ Object addition/removal
  • ✅ Background replacement
  • ✅ Text editing within images

Key Features:

  • Convenience methods for common editing tasks
  • Chained editing workflow support
  • Strength control for edit intensity

🚀 Main Testing Script (main.py)

Purpose: Comprehensive testing and demonstration interface.

Modes:

  1. Command Line: Direct operations via CLI arguments
  2. Interactive: Shell-like interface for experimentation
  3. Benchmark: Performance testing and metrics
  4. System Info: Hardware and configuration display

Usage Examples:

# List available models
python main.py --list-models

# Generate image
python main.py --model qwen_image --text-to-image \
  --prompt "A beautiful sunset over mountains" \
  --aspect-ratio 16:9

# Edit image
python main.py --model qwen_image_edit --edit-image \
  --image input.jpg --prompt "Change to winter scene"

# Interactive mode
python main.py --interactive

# Benchmark model
python main.py --benchmark --model qwen_image

🔧 Extension Guide

Adding a New Model

  1. Create Model Class:
# vlm_framework/models/new_model.py
class NewModel(BaseVLMModel):
    @property
    def capabilities(self):
        return ModelCapabilities(text_to_image=True)
    
    def load_model(self, device=None, dtype=None):
        # Model loading logic
        pass
    
    def generate(self, **kwargs):
        # Generation logic
        pass
  1. Add Configuration:
# config.yaml
models:
  new_model:
    model_name: "path/to/model"
    model_type: "text_to_image"
    default_params:
      steps: 50
  1. Register Model:
# Update vlm_framework/core/model_factory.py
from ..models.new_model import NewModel
ModelFactory.register_model('new_model', NewModel)

Supported Model Types

The framework is designed to support various model types:

  • text_to_image: Generate images from text prompts
  • image_edit: Edit existing images with text instructions
  • image_to_image: Transform images with style/content changes
  • inpainting: Fill masked regions in images
  • outpainting: Extend image boundaries
  • controlnet: Conditional generation with control signals

Future Extension Examples

Stable Diffusion:

class StableDiffusionModel(BaseVLMModel):
    def load_model(self, device=None, dtype=None):
        self.pipeline = StableDiffusionPipeline.from_pretrained(
            self.model_name, torch_dtype=self.dtype
        )
        # Additional setup...

DALL-E API:

class DALLEModel(BaseVLMModel):
    def load_model(self, device=None, dtype=None):
        # API-based model, no local loading
        self.api_client = OpenAI(api_key=self.config['api_key'])
        
    def generate(self, prompt, **kwargs):
        # API call implementation
        pass

📊 Performance Considerations

Memory Management

  • Attention Slicing: Reduces memory usage for high-resolution generation
  • CPU Offloading: Moves model components to CPU when not in use
  • Model Unloading: Explicit memory cleanup when switching models
  • Batch Size Optimization: Dynamic batch sizing based on available memory

Device Optimization

  • Automatic Device Selection: CUDA → MPS → CPU fallback
  • Mixed Precision: Uses bfloat16/float16 on GPU for faster inference
  • Compilation: torch.compile optimization when available
  • Memory Monitoring: Real-time memory usage tracking

🧪 Testing Strategy

Unit Tests (Planned)

  • Model loading/unloading
  • Parameter validation
  • Configuration parsing
  • Device selection logic

Integration Tests (Planned)

  • End-to-end generation workflows
  • Multi-model switching
  • Memory management under load
  • Configuration edge cases

Performance Tests

  • Benchmark script included in main.py
  • Memory usage profiling
  • Generation speed metrics
  • Device-specific optimization validation

🔐 Security Considerations

Model Security

  • No automatic code execution from model configurations
  • Sanitized file path handling
  • Safe YAML loading (yaml.safe_load)
  • Input validation for all user parameters

API Security (Future)

  • Rate limiting for API endpoints
  • Authentication for model access
  • Secure API key handling
  • Input sanitization for prompts

📈 Monitoring and Logging

Logging Levels

  • DEBUG: Detailed execution traces
  • INFO: Model operations and status
  • WARNING: Performance/compatibility issues
  • ERROR: Operation failures

Metrics Collection

  • Generation timing
  • Memory usage patterns
  • Error rates by model
  • Device utilization

Output Management

  • Structured metadata with all images
  • Generation parameter recording
  • Automatic output directory creation
  • File naming with timestamps

About

The VLM Framework is an extensible, production-ready system for working with Vision-Language Models. It provides a unified interface for different VLM models while maintaining flexibility for future extensions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages