This repository provides a clean implementation of DreamBooth from scratch with several modes. DreamBooth is a technique for fine-tuning text-to-image diffusion models (e.g., Stable Diffusion) on a specific subject using only a few reference images. The code has a simple CLI for training and inference is easy to extend and customize.
The implemented DreamBooth supports three modes:
- Fine-tune only the diffusion model (UNet component)
- Fine-tune the UNet component alongside the entire text encoder
- Fine-tune the UNet component alongside a straightforward implementation of Textual Inversion, which adds a single special token to the tokenizer and trains only the corresponding row in embedding table of the text encoder
Each mode has its own pros and cons. Based on experiments:
- Mode (1) works well for rigid objects such as teapots and maintains good diversity and accuracy, since the text encoder is left untouched.
- Mode (2) performs better for more complex and deformable subjects such as humans, dogs, and toys.
- Mode (3) was implemented out of curiosity by combining DreamBooth and Textual Inversion techniques. Experiments showed that it does not preserve subject identity.
make venvsource .venv/bin/activatemake install-gpu
make install-devIf you want to change the version of cuda enabled torch (currently CUDA 12.8), you can modify install-gpu section in Makefile.
make check-gpuCheck supported commands and their options:
dreambooth --help
dreambooth train --help
dreambooth infer --help
For examples of how to use dreambooth train and dreambooth infer, as well as some generated images, please see the sample folder.
In the samples directory, you can find links to the images used in the experiments. If you want to train on your own data, you can follow the same structure and setup used in the samples.
Experiments show that choosing good hyperparameters is essential for achieving high-quality results. Additionally, it is recommended to generate multiple samples for each prompt to increase the probability of obtaining a good output. Finally, a well-crafted and expressive prompt is highly important. Very short or overly complicated prompts usually do not perform nearly as well as clear, descriptive ones.
- For the commands used for training and inference, see, see
sample_runs/dog:
- For the commands used for training and inference, see, see
sample_runs/red_cartoon:
- For the commands used for training and inference, see, see
sample_runs/monster_toy:
- For the commands used for training and inference, see, see
sample_runs/teapot:
| Item 1 | Item 2 | Item 3 |
|---|---|---|
![]() a photo of <q5xv> teapot in snow. |
![]() a photo of <q5xv> teapot with flower design. |
![]() a <q5xv> teapot made of glass. |
- For the commands used for training and inference, see, see
sample_runs/face1:
Released under the MIT License. For pretrained models or datasets, please check their respective licenses.
- Ruiz, Nataniel, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500-22510. 2023.
- Gal, Rinon, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. "An image is worth one word: Personalizing text-to-image generation using textual inversion." arXiv preprint arXiv:2208.01618 (2022).





























