The VLM Framework is an extensible, production-ready system for working with Vision-Language Models. It provides a unified interface for different VLM models while maintaining flexibility for future extensions.
- Extensibility: Easy to add new VLM models beyond Qwen-Image
- Modularity: Clean separation of concerns with abstract interfaces
- Performance: Optimized for GPU/CPU usage with memory management
- Usability: Simple API for both beginners and advanced users
- Configuration-Driven: YAML-based configuration system
vlm/
├── vlm_framework/ # Core framework package
│ ├── __init__.py # Package exports and version info
│ ├── core/ # Core framework components
│ │ ├── __init__.py
│ │ ├── base_model.py # Abstract base class for all models
│ │ └── model_factory.py # Factory pattern for model creation
│ ├── models/ # Model implementations
│ │ ├── __init__.py
│ │ ├── qwen_image.py # Qwen-Image text-to-image model
│ │ └── qwen_image_edit.py # Qwen-Image-Edit editing model
│ └── utils/ # Utility modules
│ ├── __init__.py
│ ├── config_loader.py # Configuration management
│ └── device_utils.py # Device/memory management
├── assets/ # Demo images and resources
│ └── [various image files]
├── main.py # Main testing and demo script
├── config.yaml # Framework configuration
├── requirements.txt # Python dependencies
├── CLAUDE.md # Claude Code guidance
├── PROJECT_STRUCTURE.md # This file
└── README.md # Original Qwen-Image documentation
Purpose: Defines the contract that all VLM models must implement.
Key Classes:
BaseVLMModel
: Abstract base class with common interfaceGenerationResult
: Container for model outputs with metadataModelCapabilities
: Describes what each model can do
Key Methods:
# Model lifecycle
load_model(device, dtype) -> bool
unload_model() -> bool
# Core generation interface
generate(**kwargs) -> GenerationResult
# Introspection
capabilities -> ModelCapabilities
model_type -> str
is_loaded() -> bool
Purpose: Centralized model creation and management using Factory pattern.
Features:
- Automatic model registration system
- Singleton model instances with caching
- Model discovery and introspection
- Memory management for multiple models
Usage:
# Get a model instance
factory = ModelFactory()
model = factory.get_or_create_model('qwen_image')
# List available models
models = factory.get_available_models()
Purpose: YAML-based configuration management with validation.
Features:
- Hierarchical configuration (framework → models → specific settings)
- Default configuration generation
- Runtime configuration updates
- Validation and error handling
Configuration Structure:
framework: # Framework-level settings
models: # Model-specific configurations
qwen_image: # Qwen-Image settings
model_name: "Qwen/Qwen-Image"
default_params: {...}
aspect_ratios: {...}
hardware: # Hardware optimization settings
output: # Output formatting settings
logging: # Logging configuration
Purpose: Intelligent device selection and memory optimization.
Features:
- Auto-detection of optimal device (CUDA/MPS/CPU)
- Memory requirement checking
- Performance optimization settings
- Memory cache management
Capabilities:
- ✅ Text-to-image generation
- ✅ Multiple aspect ratios
- ✅ Language-specific prompt enhancement
- ✅ LoRA support
- ✅ Batch processing
Key Features:
- Magic prompt enhancement for English/Chinese
- Predefined aspect ratio templates
- Comprehensive parameter validation
- Memory-optimized inference
Capabilities:
- ✅ Image editing with text prompts
- ✅ Style transfer
- ✅ Object addition/removal
- ✅ Background replacement
- ✅ Text editing within images
Key Features:
- Convenience methods for common editing tasks
- Chained editing workflow support
- Strength control for edit intensity
Purpose: Comprehensive testing and demonstration interface.
Modes:
- Command Line: Direct operations via CLI arguments
- Interactive: Shell-like interface for experimentation
- Benchmark: Performance testing and metrics
- System Info: Hardware and configuration display
Usage Examples:
# List available models
python main.py --list-models
# Generate image
python main.py --model qwen_image --text-to-image \
--prompt "A beautiful sunset over mountains" \
--aspect-ratio 16:9
# Edit image
python main.py --model qwen_image_edit --edit-image \
--image input.jpg --prompt "Change to winter scene"
# Interactive mode
python main.py --interactive
# Benchmark model
python main.py --benchmark --model qwen_image
- Create Model Class:
# vlm_framework/models/new_model.py
class NewModel(BaseVLMModel):
@property
def capabilities(self):
return ModelCapabilities(text_to_image=True)
def load_model(self, device=None, dtype=None):
# Model loading logic
pass
def generate(self, **kwargs):
# Generation logic
pass
- Add Configuration:
# config.yaml
models:
new_model:
model_name: "path/to/model"
model_type: "text_to_image"
default_params:
steps: 50
- Register Model:
# Update vlm_framework/core/model_factory.py
from ..models.new_model import NewModel
ModelFactory.register_model('new_model', NewModel)
The framework is designed to support various model types:
- text_to_image: Generate images from text prompts
- image_edit: Edit existing images with text instructions
- image_to_image: Transform images with style/content changes
- inpainting: Fill masked regions in images
- outpainting: Extend image boundaries
- controlnet: Conditional generation with control signals
Stable Diffusion:
class StableDiffusionModel(BaseVLMModel):
def load_model(self, device=None, dtype=None):
self.pipeline = StableDiffusionPipeline.from_pretrained(
self.model_name, torch_dtype=self.dtype
)
# Additional setup...
DALL-E API:
class DALLEModel(BaseVLMModel):
def load_model(self, device=None, dtype=None):
# API-based model, no local loading
self.api_client = OpenAI(api_key=self.config['api_key'])
def generate(self, prompt, **kwargs):
# API call implementation
pass
- Attention Slicing: Reduces memory usage for high-resolution generation
- CPU Offloading: Moves model components to CPU when not in use
- Model Unloading: Explicit memory cleanup when switching models
- Batch Size Optimization: Dynamic batch sizing based on available memory
- Automatic Device Selection: CUDA → MPS → CPU fallback
- Mixed Precision: Uses bfloat16/float16 on GPU for faster inference
- Compilation: torch.compile optimization when available
- Memory Monitoring: Real-time memory usage tracking
- Model loading/unloading
- Parameter validation
- Configuration parsing
- Device selection logic
- End-to-end generation workflows
- Multi-model switching
- Memory management under load
- Configuration edge cases
- Benchmark script included in main.py
- Memory usage profiling
- Generation speed metrics
- Device-specific optimization validation
- No automatic code execution from model configurations
- Sanitized file path handling
- Safe YAML loading (yaml.safe_load)
- Input validation for all user parameters
- Rate limiting for API endpoints
- Authentication for model access
- Secure API key handling
- Input sanitization for prompts
- DEBUG: Detailed execution traces
- INFO: Model operations and status
- WARNING: Performance/compatibility issues
- ERROR: Operation failures
- Generation timing
- Memory usage patterns
- Error rates by model
- Device utilization
- Structured metadata with all images
- Generation parameter recording
- Automatic output directory creation
- File naming with timestamps