🚀 SEED-Voken: A Series of Powerful Visual Tokenizers

The project aims to provide advanced visual tokenizers for autoregressive visual generation and currently supports the following methods:

Open-MAGVIT2: An Open-source Project Toward Democratizing Auto-Regressive Visual Generation
Zhuoyan Luo*, Fengyuan Shi*, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan
ARC Lab Tencent PCG, Tsinghua University, Nanjing University
📚Open-MAGVIT2.md
@article{luo2024open,
  title={Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation},
  author={Luo, Zhuoyan and Shi, Fengyuan and Ge, Yixiao and Yang, Yujiu and Wang, Limin and Shan, Ying},
  journal={arXiv preprint arXiv:2409.04410},
  year={2024}
}

IBQ: Taming Scalable Visual Tokenizer for Autoregressive Image Generation
Fengyuan Shi*, Zhuoyan Luo*, Yixiao Ge, Yujiu Yang, Ying Shan, Limin Wang
Nanjing University, Tsinghua University, ARC Lab Tencent PCG
📚IBQ.md
@article{shi2024taming,
  title={Taming Scalable Visual Tokenizer for Autoregressive Image Generation},
  author={Shi, Fengyuan and Luo, Zhuoyan and Ge, Yixiao and Yang, Yujiu and Shan, Ying and Wang, Limin},
  journal={arXiv preprint arXiv:2412.02692},
  year={2024}
}

📰 News

[2025.02.14]🔥🔥🔥 The pretrained version of IBQ visual tokenizers, which achieves SOTA performance with high code dimension is released.
[2025.02.09] We release Open-MAGVIT2 Video tokenizers, which achieves SOTA performance compared to OmniTokenizer, LARP and SweetTokenizer.
[2025.01.21] Open-MAGVIT2 tokenizers (codebook size of 16384 and 262144) for text-conditional image generation are now released! They are pretrained with large-scale image-text datasets, achieving SOTA performance compared to LlamaGen, Show-o, and Cosmos.
[2024.11.26] We are excited to release IBQ, a series of scalable visual tokenizers, which achieve a large-scale codebook (2^18) with high dimension (256) and high utilization.
[2024.09.09] We release an improved version of Open-MAGVIT2 tokenizer and a family of auto-regressive models ranging from 300M to 1.5B.
[2024.06.17] We release the training code of the Open-MAGVIT2 tokenizer and checkpoints for different resolutions, achieving state-of-the-art performance (0.39 rFID for 8x downsampling) compared to VQGAN, MaskGIT, and recent TiTok, LlamaGen, and OmniTokenizer.

📖 Implementations

Our codebase supports both NPU and GPU for training and inference. All experiments were conducted using the Ascend 910B for training, and we validated our models on the V100. The observed performance between the two platforms is nearly identical.

🛠️ Installation

GPU

Env: We have tested on Python 3.8.8 and CUDA 11.8 (other versions may also be fine).
Dependencies: pip install -r requirements.txt

NPU

Image Version

Env: Python 3.9.16 and CANN 8.0.T13
Main Dependencies: torch=2.1.0+cpu + torch-npu=2.1.0.post3-20240523 + Lightning

Video Version

Env Python 3.9.16 and CANN 8.0.T62
Main Dependencies: torch=2.1.0+cpu + torch-npu=2.1.0.post10.dev20241128 + Lightning

Other Dependencies: see in requirements.txt

Datasets

Image Dataset

We use Imagenet2012 as our Image dataset.

imagenet
└── train/
    ├── n01440764
        ├── n01440764_10026.JPEG
        ├── n01440764_10027.JPEG
        ├── ...
    ├── n01443537
    ├── ...
└── val/
    ├── ...

Video Dataset

We use UCF-101 as our Video Dataset

UCF101
└── train/
    ├── class_0
        ├── video_1.mp4
        ├── video_2.mp4
        ├── ...
    ├── class_1
    ├── class_2
└── val/
    ├── ...

The preparation of UCF-101 can be referred to VideoGPT

Text2Image Datasets

We recommend the data are organized in the following tar format.

data
└── LAION_COCO/
    ├── webdataset
        ├── 1.tar
        ├── 2.tar
        ├── 3.tar
        ├── ...
└── CC12M/
    ├── webdataset
        ├── 1.tar
        ├── 2.tar
        ├── 3.tar
        ├── ...

Before pretraining, the sample.json and filter_keys.json of each datasets should be prepared. Please refer to src/Open_MAGVIT2/data/prepare_pretrain.py

⚡ Training & Evaluation

The training and evaluation scripts are in Open-MAGVIT2.md and IBQ.md.

❤️ Acknowledgement

We thank Lijun Yu for his encouraging discussions. We refer a lot from VQGAN and MAGVIT. We also refer to LlamaGen, VAR, RQVAE and VideoGPT, OmniTokenizer. Thanks for their wonderful work.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
configs		configs
docs		docs
metrics		metrics
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
combine_npz.py		combine_npz.py
evaluation_image.py		evaluation_image.py
evaluation_original_reso.py		evaluation_original_reso.py
evaluation_video.py		evaluation_video.py
generate.py		generate.py
main.py		main.py
reconstruct_image.py		reconstruct_image.py
reconstruct_video.py		reconstruct_video.py
requirements.txt		requirements.txt
sample.py		sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 SEED-Voken: A Series of Powerful Visual Tokenizers

📰 News

📖 Implementations

🛠️ Installation

GPU

NPU

Image Version

Video Version

Datasets

⚡ Training & Evaluation

❤️ Acknowledgement

About

Releases

Packages

Contributors 3

Languages

License

TencentARC/SEED-Voken

Folders and files

Latest commit

History

Repository files navigation

🚀 SEED-Voken: A Series of Powerful Visual Tokenizers

📰 News

📖 Implementations

🛠️ Installation

GPU

NPU

Image Version

Video Version

Datasets

⚡ Training & Evaluation

❤️ Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages