Introduction

PaddleMIX is a large multi-modal development kit based on PaddlePaddle, which aggregates multiple functions such as images, texts, and videos, and covers a variety of multi-modal tasks such as visual language pre-training, textual images, and textual videos. It provides an out-of-the-box development experience while meeting developers’ flexible customization needs and exploring general artificial intelligence.

Updates

2024.04.17

PPDiffusers published version 0.24.0, it supports DiT and other Sora-related technologies. Supporting SVD and other video generation models

2023.10.7

Published PaddleMIX version 1.0
Newly added distributed training capability for image-text pre-training models. BLIP-2 supports training on scales up to one hundred billion parameters.
Newly added cross-modal application pipeline AppFlow, which supports automatic annotation, image editing, sound-to-image, and 11 other cross-modal applications with just one click.
PPDiffusers has released version 0.19.3, introducing SDXL and related tasks.

2023.7.31

Published PaddleMIX version 0.1
The PaddleMIX large multi-modal model development toolkit is released for the first time, integrating the PPDiffusers multi-modal diffusion model toolbox and widely supporting the PaddleNLP large-language models.
Added 12 new large multi-modal models including EVA-CLIP, BLIP-2, miniGPT-4, Stable Diffusion, ControlNet, etc.

Main Features

Rich Multi-Modal Functionality: Encompassing image-text pre-training, text-to-image, multi-modal visual tasks, enabling diverse functions like image editing, image description, data annotation, and more.
Simplified Development Experience: Unified model development interface facilitating efficient custom model development and feature implementation.
Efficient Training and Inference Workflow: Streamlined end-to-end development process for training and inference, with standout performance in training and inference for key models such as BLIP-2, Stable Diffusion, etc., leading the industry.
Support for Ultra-Large Scale Training: Capable of training models up to the scale of hundreds of billions for image-text pre-training, and base models up to the scale of tens of billions for text-to-image.

Demo

video Demo

PaddleMix.mp4

Installation

Environment Dependencies

pip install -r requirements.txt

Detailed installation tutorials for PaddlePaddle

Note: parts of Some models in ppdiffusers require CUDA 11.2 or higher. If your local machine does not meet the requirements, it is recommended to go to AI Studio for model training and inference tasks.

If you wish to train and infer using bf16, please use a GPU that supports bf16, such as the A100.

Manual Installation

git clone https://github.com/PaddlePaddle/PaddleMIX
cd PaddleMIX
pip install -e .

#ppdiffusers 安装
cd ppdiffusers
pip install -e .

Tutorial

Quick Start
Fine-Tuning
Inference Deployment

Specialized Applications

Artistic Style QR Code Model

Try it out: https://aistudio.baidu.com/community/app/1339

Image Mixing

Try it out: https://aistudio.baidu.com/community/app/1340

Datasets

Multi-modal Pre-training

Diffusion-based Models

Image-Text Pre-training

EVA-CLIP
CoCa
CLIP
BLIP-2
miniGPT-4
VIsualGLM
Qwen_VL
LLaVA
CogVLM && CogAgent
InternLM-XComposer2

Open World Vision Models

Grounding DINO
SAM

More Multi-Modal Pre-trained Models

ImageBind

Text-to-Image

Stable Diffusion
ControlNet
LDM
Unidiffuser

Text-to-Video

LVDM

Audio Generation

AudioLDM

For more information on additional model capabilities, please refer to the Model Capability Matrix.

LICENSE

This repository is licensed under the Apache 2.0 license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Introduction

Updates

Main Features

Demo

Installation

Tutorial

Specialized Applications

Datasets

LICENSE

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Introduction

Updates

Main Features

Demo

Installation

Tutorial

Specialized Applications

Datasets

LICENSE