Image Synthesis project: Customize storybooks A pipeline from pretrained foundation models that are customized with 3-5 images and a text prompt given by the user to output a visual narrative.
Recent advancements in generative models and LLMs, such as Stable Diffusion and ChatGPT, have ushered in a new wave of excitement for AI. We want to leverage this technology to enable applications that, up until recently, might have seemed like magic. We will focus on creating illustrations for a chosen story, where the main character(s) of the plot can be customized to the userâs liking. By using user input to inform the storyâs title and characters, our system allows people to see themselves in the stories they read and to explore their own imagination. Our goal is to obtain the best results possible, which means leveraging pretrained foundation models as much as we can and exploring SOTA image editing methods for customization. The result will be a cohesive, compelling illustrated story where anyone can be the protagonist.
Our end product is a live demo that can generate a custom illustrated storybook. The inputs will be a story title and seed images for character customization. The output will be generated illustrations that tell a beautiful, coherently styled story. If time permits, we can implement other extensions, such as multiple character customization and style customization.
run without illustrator, story saved to output/jack_and_the_beanstalk/story.txt
python tut.py --title "Jack and the Beanstalk" --orig_name Jack --config_path config/story-only.yml
run with an uncustomized illustrator, story saved to output/jack_and_the_beanstalk/story.txt
python tut.py --title "Little Red Riding Hood" --orig_name "Little Red Riding Hood" --config_path config/openjourney.yml
change jack to simon but with generic orig_name
, specific custom_name
, and generic character description
python illustrator.py \
--orig_object "boy" \
--orig_name "Jack" \
--custom_name Simonstasta \
--custom_img_dir "sample_images/simon_512" \
--prompts_path output/jack_and_the_beanstalk/story.txt \
--config_path "config/text-encoder-mixed-precision-prior-preserve.yml" \
use trained customization model to generate images
python illustrator.py \
--orig_name "Jack" \
--custom_name Simonstasta \
--prompts_path output/jack_and_the_beanstalk/story.txt \
--config_path "config/text-encoder-mixed-precision-prior-preserve.yml"
--skip_training
python tut.py \
--title "Jack and the Beanstalk" \
--orig_object "boy" \
--orig_name "Jack" \
--custom_name Simonstasta \
--custom_img_dir "sample_images/simon_512" \
--config_path "config/text-encoder-mixed-precision-prior-preserve.yml"
We will be relying on ChatGPT or opensource alternatives like OPT [6] for story generation, and use prompts such as: âSummarize the plot of âLord of the Ringsâ in ten sentences. Tell it to me like a childrenâs story.â This will output ten sentences that can be used as prompts in the text-to-image generation process.
We will use the best-open-sourced text-to-image model (Stable Diffusion [4]) to generate images from text prompts that describe the story plot.
We will utilize existing techniques such as DreamBooth [5], Textual Inversion [2], Custom Diffusion for multiple- concepts [3], and InstructPix2Pix[1]. We will experiment with each method to see what works best for our intended usage. Below we share some initial hypotheses based on the literature. DreamBooth and Custom Diffusion are particularly useful for character customization, as they allows for model fine- tuning to generate images that are not only similar but also adjust to new context, and thus can most accurately depict characters provided by the user. On the other hand, image-guided Textual Inversion tries to reconstruct the input from its prior distribution. Objects or backgrounds of the scene may benefit from the resulting diversity of Textual Inversion, but for inserting main charac- ters with recognizable faces, it may not work as well as methods mentioned above. Another model worth considering for image customization is InstructPix2Pix[1]. This particular model has the ability to create images even when a full description isnât provided. One notable feature of InstructPix2Pix is that it allows users to input real images and only a simple instruction to get the corresponding results.
We want to ensure that the overall style of images in the storybook stays consistent. This could be approached in three ways (from easy to difficult). (1) tell our text-to-image model to generate in a particular style, forcing consistency across pages (2) generate each image and provide the first generated image or a user provided image as a reference style image, or (3) train a style consistency loss over all of the images.
The overall goal of this project is to create a pipeline from pretrained foundation models that are customized with 3-5 images and a text prompt given by the user to output a visual narrative. This will be possible because each of the image generation algorithms we consider (DreamBooth, Textual Inversion, or Custom Diffusion) only requires a handful images. We will gather four sets of images for development: photos of each of the three authors and a dog.
Some practical concerns include computing power required by each customization model and inference time required during the demo. To handle the second concern, we can process in an async manner and email with the finished product when finished.
- [1] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. 2023. arXiv: 2211.09800 [cs.CV].
- [2] Rinon Gal et al. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. 2022. DOI: 10.48550/ARXIV.2208.01618. URL: https://arxiv.org/abs/2208.01618.
- [3] Nupur Kumari et al. âMulti-Concept Customization of Text-to-Image Diffusionâ. In: CVPR. 2023.
- [4] Robin Rombach et al. âHigh-Resolution Image Synthesis with Latent Diffusion Modelsâ. In: CoRR abs/2112.10752 (2021). arXiv: 2112.10752. URL: https://arxiv.org/abs/2112.10752.
- [5] Nataniel Ruiz et al. âDreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generationâ. In: (2022).
- [6] Susan Zhang et al. OPT: Open Pre-trained Transformer Language Models. 2022. arXiv: 2205.01068 [cs.CL].