VLM Framework - Project Structure Documentation

🏗️ Architecture Overview

The VLM Framework is an extensible, production-ready system for working with Vision-Language Models. It provides a unified interface for different VLM models while maintaining flexibility for future extensions.

🎯 Design Principles

Extensibility: Easy to add new VLM models beyond Qwen-Image
Modularity: Clean separation of concerns with abstract interfaces
Performance: Optimized for GPU/CPU usage with memory management
Usability: Simple API for both beginners and advanced users
Configuration-Driven: YAML-based configuration system

📁 Directory Structure

vlm/
├── vlm_framework/                 # Core framework package
│   ├── __init__.py               # Package exports and version info
│   ├── core/                     # Core framework components
│   │   ├── __init__.py          
│   │   ├── base_model.py        # Abstract base class for all models
│   │   └── model_factory.py     # Factory pattern for model creation
│   ├── models/                   # Model implementations
│   │   ├── __init__.py
│   │   ├── qwen_image.py        # Qwen-Image text-to-image model
│   │   └── qwen_image_edit.py   # Qwen-Image-Edit editing model
│   └── utils/                    # Utility modules
│       ├── __init__.py
│       ├── config_loader.py     # Configuration management
│       └── device_utils.py      # Device/memory management
├── assets/                       # Demo images and resources
│   └── [various image files]
├── main.py                      # Main testing and demo script
├── config.yaml                 # Framework configuration
├── requirements.txt            # Python dependencies
├── CLAUDE.md                   # Claude Code guidance
├── PROJECT_STRUCTURE.md        # This file
└── README.md                   # Original Qwen-Image documentation

🧩 Core Components

1. Base Model Interface (`core/base_model.py`)

Purpose: Defines the contract that all VLM models must implement.

Key Classes:

BaseVLMModel: Abstract base class with common interface
GenerationResult: Container for model outputs with metadata
ModelCapabilities: Describes what each model can do

Key Methods:

# Model lifecycle
load_model(device, dtype) -> bool
unload_model() -> bool

# Core generation interface
generate(**kwargs) -> GenerationResult

# Introspection
capabilities -> ModelCapabilities
model_type -> str
is_loaded() -> bool

2. Model Factory (`core/model_factory.py`)

Purpose: Centralized model creation and management using Factory pattern.

Features:

Automatic model registration system
Singleton model instances with caching
Model discovery and introspection
Memory management for multiple models

Usage:

# Get a model instance
factory = ModelFactory()
model = factory.get_or_create_model('qwen_image')

# List available models
models = factory.get_available_models()

3. Configuration System (`utils/config_loader.py`)

Purpose: YAML-based configuration management with validation.

Features:

Hierarchical configuration (framework → models → specific settings)
Default configuration generation
Runtime configuration updates
Validation and error handling

Configuration Structure:

framework:          # Framework-level settings
models:            # Model-specific configurations
  qwen_image:      # Qwen-Image settings
    model_name: "Qwen/Qwen-Image"
    default_params: {...}
    aspect_ratios: {...}
hardware:          # Hardware optimization settings
output:            # Output formatting settings
logging:           # Logging configuration

4. Device Management (`utils/device_utils.py`)

Purpose: Intelligent device selection and memory optimization.

Features:

Auto-detection of optimal device (CUDA/MPS/CPU)
Memory requirement checking
Performance optimization settings
Memory cache management

🤖 Model Implementations

Qwen-Image (`models/qwen_image.py`)

Capabilities:

✅ Text-to-image generation
✅ Multiple aspect ratios
✅ Language-specific prompt enhancement
✅ LoRA support
✅ Batch processing

Key Features:

Magic prompt enhancement for English/Chinese
Predefined aspect ratio templates
Comprehensive parameter validation
Memory-optimized inference

Qwen-Image-Edit (`models/qwen_image_edit.py`)

Capabilities:

✅ Image editing with text prompts
✅ Style transfer
✅ Object addition/removal
✅ Background replacement
✅ Text editing within images

Key Features:

Convenience methods for common editing tasks
Chained editing workflow support
Strength control for edit intensity

🚀 Main Testing Script (`main.py`)

Purpose: Comprehensive testing and demonstration interface.

Modes:

Command Line: Direct operations via CLI arguments
Interactive: Shell-like interface for experimentation
Benchmark: Performance testing and metrics
System Info: Hardware and configuration display

Usage Examples:

# List available models
python main.py --list-models

# Generate image
python main.py --model qwen_image --text-to-image \
  --prompt "A beautiful sunset over mountains" \
  --aspect-ratio 16:9

# Edit image
python main.py --model qwen_image_edit --edit-image \
  --image input.jpg --prompt "Change to winter scene"

# Interactive mode
python main.py --interactive

# Benchmark model
python main.py --benchmark --model qwen_image

🔧 Extension Guide

Adding a New Model

Create Model Class:

# vlm_framework/models/new_model.py
class NewModel(BaseVLMModel):
    @property
    def capabilities(self):
        return ModelCapabilities(text_to_image=True)
    
    def load_model(self, device=None, dtype=None):
        # Model loading logic
        pass
    
    def generate(self, **kwargs):
        # Generation logic
        pass

Add Configuration:

# config.yaml
models:
  new_model:
    model_name: "path/to/model"
    model_type: "text_to_image"
    default_params:
      steps: 50

Register Model:

# Update vlm_framework/core/model_factory.py
from ..models.new_model import NewModel
ModelFactory.register_model('new_model', NewModel)

Supported Model Types

The framework is designed to support various model types:

text_to_image: Generate images from text prompts
image_edit: Edit existing images with text instructions
image_to_image: Transform images with style/content changes
inpainting: Fill masked regions in images
outpainting: Extend image boundaries
controlnet: Conditional generation with control signals

Future Extension Examples

Stable Diffusion:

class StableDiffusionModel(BaseVLMModel):
    def load_model(self, device=None, dtype=None):
        self.pipeline = StableDiffusionPipeline.from_pretrained(
            self.model_name, torch_dtype=self.dtype
        )
        # Additional setup...

DALL-E API:

class DALLEModel(BaseVLMModel):
    def load_model(self, device=None, dtype=None):
        # API-based model, no local loading
        self.api_client = OpenAI(api_key=self.config['api_key'])
        
    def generate(self, prompt, **kwargs):
        # API call implementation
        pass

📊 Performance Considerations

Memory Management

Attention Slicing: Reduces memory usage for high-resolution generation
CPU Offloading: Moves model components to CPU when not in use
Model Unloading: Explicit memory cleanup when switching models
Batch Size Optimization: Dynamic batch sizing based on available memory

Device Optimization

Automatic Device Selection: CUDA → MPS → CPU fallback
Mixed Precision: Uses bfloat16/float16 on GPU for faster inference
Compilation: torch.compile optimization when available
Memory Monitoring: Real-time memory usage tracking

🧪 Testing Strategy

Unit Tests (Planned)

Model loading/unloading
Parameter validation
Configuration parsing
Device selection logic

Integration Tests (Planned)

End-to-end generation workflows
Multi-model switching
Memory management under load
Configuration edge cases

Performance Tests

Benchmark script included in main.py
Memory usage profiling
Generation speed metrics
Device-specific optimization validation

🔐 Security Considerations

Model Security

No automatic code execution from model configurations
Sanitized file path handling
Safe YAML loading (yaml.safe_load)
Input validation for all user parameters

API Security (Future)

Rate limiting for API endpoints
Authentication for model access
Secure API key handling
Input sanitization for prompts

📈 Monitoring and Logging

Logging Levels

DEBUG: Detailed execution traces
INFO: Model operations and status
WARNING: Performance/compatibility issues
ERROR: Operation failures

Metrics Collection

Generation timing
Memory usage patterns
Error rates by model
Device utilization

Output Management

Structured metadata with all images
Generation parameter recording
Automatic output directory creation
File naming with timestamps

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
docs		docs
vlm_framework		vlm_framework
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt
setup_and_usage.py		setup_and_usage.py

songminkyu/multimodal-benchmark

Folders and files

Latest commit

History

Repository files navigation

VLM Framework - Project Structure Documentation

🏗️ Architecture Overview

🎯 Design Principles

📁 Directory Structure

🧩 Core Components

1. Base Model Interface (core/base_model.py)

2. Model Factory (core/model_factory.py)

3. Configuration System (utils/config_loader.py)

4. Device Management (utils/device_utils.py)

🤖 Model Implementations

Qwen-Image (models/qwen_image.py)

Qwen-Image-Edit (models/qwen_image_edit.py)

🚀 Main Testing Script (main.py)

🔧 Extension Guide

Adding a New Model

Supported Model Types

Future Extension Examples

📊 Performance Considerations

Memory Management

Device Optimization

🧪 Testing Strategy

Unit Tests (Planned)

Integration Tests (Planned)

Performance Tests

🔐 Security Considerations

Model Security

API Security (Future)

📈 Monitoring and Logging

Logging Levels

Metrics Collection

Output Management

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Base Model Interface (`core/base_model.py`)

2. Model Factory (`core/model_factory.py`)

3. Configuration System (`utils/config_loader.py`)

4. Device Management (`utils/device_utils.py`)

Qwen-Image (`models/qwen_image.py`)

Qwen-Image-Edit (`models/qwen_image_edit.py`)

🚀 Main Testing Script (`main.py`)

Packages