Generate car images from segmentation maps using the carvana dataset.
This model was trained to generate a synthetic image of a car from a segmentation map using a pix2pixHD conditional generative adversarial network (CGAN).
For this project we use the single car image segmentation dataset - Carvana Image Masking (PNG) https://www.kaggle.com/datasets/ipythonx/carvana-image-masking-png.
The entire dataset contains 5088 image-mask pairs and the dimensions of each image are 448x320 pixels. Since this model is intended for initial testing, we decided to use a subset of 701 pairs for quicker fine-tuning. We created an imageDatastore and a pixelLabelDatastore to store the images and the pixel label images repectively.
We defined the class names and pixel label IDs for the two classes, car and background, in the Carvana dataset using the helper function define2ClassesAndPixelLabelIDs. Additionally, we generated a standard colormap for the dataset with the helper function 2ColorMap. Both helper functions are included as supporting files in the example.
We partitioned the data into training and test sets using the helper function partitionForPix2PixHD.
We performed the following preprocessing steps to augment the data for training:
- Scaled the ground truth data to the range [-1, 1], aligning it with the range of the final
tanhLayerin the generator network. - Resized the images and labels to the network's output size of 576-by-768 pixels using bicubic downsampling for images and nearest-neighbor downsampling for labels.
- Converted the single-channel segmentation map into a 32-channel one-hot encoded segmentation map using the
onehotencodefunction. - Randomly applied horizontal flipping to the image and pixel label pairs to augment the dataset.
We define a pix2pixHD generator network that converts a depth-wise one-hot encoded segmentation map (number of channels = number of classes in the segmentation) into a scene image. Our input maintains the same height and width as the original segmentation map.
We define PatchGAN discriminator networks to classify an input image as either real (1) or fake (0). In this example, we use two multiscale discriminators operating at different input scales: one at the original image size and another at half the image size. The input to each discriminator is the depth-wise concatenation of the one-hot encoded segmentation map and the scene image being classified. The total number of input channels for the discriminator is determined by the sum of the labeled classes in the segmentation map and the color channels of the image.
The overall generator loss is a weighted sum of all three losses. λ1, λ2, and λ3 are the weight factors for adversarial loss, feature matching loss, and perceptual loss, respectively:
- GeneratorLoss=λ1∗lossAdversarialGenerator+λ2∗lossFeatureMatching+λ3∗lossPerceptual
The discriminator's objective is to accurately differentiate between ground truth images and generated images. Its loss function consists of two components:
-The squared difference between a vector of ones and the discriminator's predictions for real images. -The squared difference between a vector of zeros and the discriminator's predictions for generated images
- DiscriminatorLoss=(1−Yreal)^2+(0−Y'generated)^2
We configure the Adam optimizer and train the model for 28 epochs, applying the same settings to both the generator and discriminator networks:
- We set the learning rate to 0.0002.
- We initialize the trailing average gradient and gradient-square decay rates with an empty array [].
- We use a gradient decay factor of 0.5 and a squared gradient decay factor of 0.999.
- We specify a mini-batch size of 1 for training.
We generated images for the first and third test image.
[1] Wang, Ting-Chun, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs." In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8798–8807, 2018. https://doi.org/10.1109/CVPR.2018.00917.
[2] Brostow, Gabriel J., Julien Fauqueur, and Roberto Cipolla. "Semantic Object Classes in Video: A High-Definition Ground Truth Database." Pattern Recognition Letters. Vol. 30, Issue 2, 2009, pp 88-97.
[3] Carvana Image Masking (PNG) https://www.kaggle.com/datasets/ipythonx/carvana-image-masking-png.
[4] MathWorks Example "Generate Image from Segmentation Map Using Deep Learning".
