Wenjie Chen, Zhangkai Ni, Hanli Wang
Text-to-image generation aims at synthesizing photo-realistic images from textual descriptions. Existing methods typically align images with the corresponding texts in a joint semantic space. However, the presence of the modality gap in the joint semantic space leads to misalignment. Meanwhile, the limited receptive field of the convolutional neural network leads to structural distortions of generated images. In this work, a structure-aware generative adversarial network (SaGAN) is proposed for (1) semantically aligning multimodel features in the joint semantic space in a learnable manner; and (2) improving the structure and contour of generated images by the designed content-invariant negative samples. Compared with the state-of-the-art models, experimental results show that SaGAN achieves over 30.1% and 8.2% improvements in terms of FID on CUB and COCO datasets, respectively.
The pipeline of the proposed SaGAN for text-to-image generation is shown in Fig. 1. We adapt the unconditional StyleGAN to a conditional generative model by combining CLIP with StyleGAN. Firstly, a parameter-shared semantic perspective extraction (SPE) module is introduced to mitigate the modality gap. The semantic similarity between multimodal features is calculated from a specific perspective to improve the accuracy of semantic alignment. Secondly, a structure-aware negative data augmentation (SNDA) strategy is adopted to prompt the model to focus on the structure and contour of the image. A content-invariant geometric transformation is designed to obtain distorted images as an additional source of negative samples, which is consistent with the real image in content and style but distorted in structure and contour.
Fig. 1. Overview of the proposed SaGAN.
As shown in Fig. 2, The thin-plate-spline interpolation is utilized to reposition the pixels on the real image according to the movement from the source point to the destination point. By setting grid points, we can transform the image structure with different densities and scales. The augmented images xaug, which preserve the content and style of real images, can force the network to focus on the structure and contour.
Fig. 2. The content-invariant geometric transformation.
We compare the SaGAN with the state-of-the-art text-to-image methods. The comparison results are shown in Table 1. The remarkable performances on the datasets demonstrate the superiority of our SaGAN. We conduct extensive ablation experiments to verify the impact of the proposed SNDA and SPE on the proposed SaGAN, the experimental results are given in Table 2. In Table 3, we compare shared SPE and unshared SPE, where the latter refers to using two SPE modules for text and image features respectively. In Table4 , we compare our SNDA with other negative data augmentation strategies.
Table 1. Comparison with state-of-the-art methods on CUB and COCO datasets, the top-3 performances are highlighted in red, blue, and black bold, respectively.
Table 2. Ablation study on CUB to investigate the effectiveness of our proposed modules.
Table 3. Ablation study with different SPE settings on CUB.
Table 4. Comparison with other negative data augmentation strategies on CUB.
Fig. 3 shows the qualitative comparisons between our SaGAN and Lafite, where images are generated condition on the description from the CUB dataset. A closer look at the image details reveals that the images generated by SaGAN have more vivid details.
Fig. 3. Qualitative comparison between SaGAN and Lafite.
The implementation is based on stylegan2-ada-pytorch and CLIP, the required packages can be found in the links.
python dataset_tool.py --source=./path_to_some_dataset/ --dest=./datasets/some_dataset.zip --width=256 --height=256 --transform=center-crop
the files at ./path_to_some_dataset/ should be like:
./path_to_some_dataset/
├ 1.png
├ 1.txt
├ 2.png
├ 2.txt
├ ...
More details on data preparation and commonly used datasets that have already been processed (with CLIP-ViT/B-32) can be found in Lafite.
These hyper-parameters are used for CUB. Please tune itd, itc and gamma on different datasets, they might be sensitive to datasets.
python train.py --gpus=4 --outdir=./outputs/ --temp=0.5 --itd=5 --itc=10 --gamma=10 --mirror=1 --data=./datasets/birds_train_clip.zip --test_data=./datasets/birds_test_clip.zip --mixing_prob=0.0
Calculating metrics:
python calc_metrics.py --network=./some_pre-trained_models.pkl --metrics=fid50k_full,is50k --data=./training_data.zip --test_data=./testing_data.zip
To generate images with pre-trained models, you can use ./generate.ipynb.
Meanwhile, you can view examples of negative sample augmentation via the snda.ipynb.
This implementation is base on stylegan2-ada-pytorch and Lafite.
Please cite the following paper if you find this work useful:
Wenjie Chen, Zhangkai Ni, and Hanli Wang, Structure-aware Generative Adversarial Network for Text-to-image Generation, IEEE International Conference on Image Processing (ICIP'23), accepted, 2023.