Knowing the dominant color of an image is often important for various problems such as aesthetics analysis, computer vision problems and content-based image retrieval. The aim of this project is to develop an approach for automatic recognition the dominant colors of an image using convolutional neural networks (CNN).
- Dataset Information
- Data Analysis
- Conclusions from Data Analysis
- Problem Solution
- Model Architecture
- Experiments
- Error Analysis
- Conclusions
- Acknowledgements
To train the dominant color detection model, tulip images from the House Plant Species dataset were used. This dataset contains a variety of plant photos, captured in a variety of color variations, backgrounds, lighting, and shooting angles. The original dataset was intended for classification, therefore it was necessary to find an adequate way to generate the appropriate dominant colors. Due to the small size of the dataset, the training dataset was expanded twofold with aggressive image augmentation. The images were resized to 224x224 pixels using the default bilinear averaging method.
Data analysis includes several methods used for image description and several methods for generating dominant colors.
In this section, the contrast is calculated by converting the image to monochrome, then calculating the standard deviation. Coloration was determined by the method described in the work of Hasler and Süsstrunk. Both values were averaged and the standard deviation was calculated. The average contrast is about 55, with a standard deviation of 12.17, and the colorfulness is about 70 with a standard deviation of 25.68.
The average entropy and its standard deviation were calculated. The average entropy is about 6.48, and its standard deviation is 0.54.
A graph of the cumulative distribution was drawn for several images from the data set and its average values for the first three quantiles, as well as their standard deviations, were calculated. The contrast of the dynamic range as well as the skewness of this function were also calculated. The average for the first quantile is around 0.21, the second around 0.54, and the third around 0.79. The average contrast of the dynamic range is about 0.99, and the distortion is about 0.09.
Figure 1: An example of cumulative image distribution function.
The mean and median were calculated for the three image channels. The three mean values obtained in this way are combined into a color that could represent the dominant one. The process was repeated for three different color spaces: RGB, HSV and LAB.
Figure 2: Mean and median colors in RGB space.
Figure 3: Mean and median colors in HSV space.
Figure 4: Mean and median colors in LAB space.
KMeans clustering with 5 clusters was performed in order to detect the dominant color. First, the image was reduced 10 times to speed up the calculation, and after clustering, the cluster with the most pixels was selected. The process was repeated for RGB, HSV and LAB space. The number of clusters was determined empirically.
Figure 5: Clustering in RGB space. Visualization of the dominant cluster.
Figure 6: Clustering in RGB space. Visualization of the dominant cluster.
Figure 7: Clustering in RGB space. Visualization of the dominant cluster.
The maximum value of the histogram was determined for all three channels of the HSV space separately. The obtained values are combined into a dominant color.
Figure 8: The maximum value of the HSV histogram. Visualization of the maximum value.
A dataset was generated with dominant colors obtained using KMeans clustering in LAB space. These colors are displayed on an interactive 3D graph.
Figure 9: 3D graph of dominant colors.
2D graphs of dominant colors of all three possible combinations of R, G and B channels are shown.
Figures 10, 11, 12: 2D graphs of dominant colors.
From the aforementioned methods of data analysis, the following conclusions were drawn:
-
The data set consists of mostly very colorful images, as well as images with a large number of colors, as indicated by high values of colorfulness and entropy.
-
The averages of the cumulative distribution function in the first three quantiles indicate that on average these images have a similar number of dark and light pixels in the images.
-
Determining the mean and median yield decently good dominant colors when it comes to RGB space. In the HSV spectrum, the same could be said only for the medians.
-
KMeans clustering algorithm with five clusters performed very well in all three spaces, perhaps best in LAB space. It gives very meaningful results for dominant colors. The success of this method can be attributed to the fact that it looks at all channels at once, unlike other methods. Looking at the spatial position of the image, and such grouping is intuitively the best way to determine the dominant color.
-
Determining the dominant color using the HSV histogram method presents a couple of problems. Taking into account only the maximum values of hue, saturation and value, we get a color that does not necessarily represent the most dominant color in the image. Often a color with a smaller e.g. saturation can dominate.
-
As tulips can be of various colors, the 3D diagram of generated dominant colors contains shades of blue, pink, purple, yellow, red and other colors of the flower. The green color comes from the stem or plants or the grass in the background. White and often black colors are dominant in the images with solid color backgrounds. Brown and variations of red and orange come from the soil. It is noticed that the most dominant colors are located on the diagonals of the 2D graphs. This means that very often background colors are chosen as dominant.
The proposed solution to this problem is a convolutional neural network (CNN) trained with dominant colors generated by KMeans clustering in LAB color space. The following two metrics are tracked:
- Loss function - MAE (Mean Absolute Error) : The average value of the absolute difference across the three channels. Less sensitive to large model errors.
- MSE (Mean Squared Error): Average value of squared difference across three channels. More sensitive to large model errors.
Figure 13: Final model architecture.
Basic CNN architecture was created in order to establish a baseline. It consists of two Convolutional layers, first with 5x5 and the second with 3x3 filters. Each of these layers was followed by ReLU activation function and MaxPool. Two fully connected layers were added, the first one with ReLU activation and the second one with Linear for generating the output color. The model was trained with Adam optimizer and learning rate of 0.001. It resulted with MAE of 0.1284 and MSE of 0.0291 on the test dataset.
Figure 14: Validation MAE of Basic CNN over time.
ResNet18 architecture was modified so that it outputs a color instead of probabilities of classes. This time, sigmoid function is introduced in the output. The default weights for ResNet were used to ensure the most recent ResNet weight improvements are used and the layers up to the second residual block were frozen. The model resulted with MAE of 0.1233 and MSE of 0.0293 on the test set.
Figure 15: Validation MAE of ResNet18 over time.
A single Squeeze-and-Excitation block was added on top of the residual blocks of ResNet. SE layers improve feature discrimination without requiring a large dataset, which aligns well with this scenario. This modification resulted with MAE of 0.1067 and MSE of 0.0237 on the test dataset.
Figure 16: Validation MAE of ResNet18 with a single Squeeze-and-Excitation block over time.
Another Squeeze-and-Excitation block was introducet to the previous modification. Two dropouts are added, one between SE blocks and one after the second SE block. It scored 0.1055 MAE and 0.0231 MSE on the test dataset.
Figure 17: Validation MAE of ResNet18 with two Squeeze-and-Excitation blocks and dropout over time.
6.5 Training the entire (unfrozen) Resnet18 with two Squeeze-and-Excitation blocks and dropout model with AdamW optimizer and learning rate reduction
The previously mentioned model was further trained, but now with all layers unfrozen and with a lower learning rate and AdamW optimizer which uses weight decay. ReduceLrOnPlateu callback was used to lower the learning rate if the model is not getting better for three consecutive epochs. The final model resulted with 0.0990 MAE and 0.0193 MSE on the test set.
Figure 18: Validation MAE of the final trained model over time on the logarithmic scale.
Model | MAE | MSE |
---|---|---|
Basic CNN | 0.1284 | 0.0291 |
ResNet18 | 0.1233 | 0.0293 |
ResNet18 + 1 SE Block | 0.1067 | 0.0237 |
ResNet18 + 2 SE Blocks + Dropout | 0.1055 | 0.0231 |
ResNet18 + 2 SE Blocks + Dropout Unfrozen | 0.0990 | 0.0193 |
Table 1: Summary of the results across all experiments.
All images test predictions were plotted against the true labels and the following two distinctive cases when the model was wrong were noticed:
The most common issue appears to occur with images containing multiple distinguishable colors, where the model tends to predict a dominant color that resembles a mix of the most prominent colors in the image. The model will most likely benefit from more data of images like this.
Figure 19: Example of an error on an image containing multiple prominent colors.
The model sometimes predicts the other dominant color as the dominant color. This usually happens when there are two colors that seem dominant. These predictions don't seem neccesarily wrong, since there can be multiple colors that could be labeled as the correct dominant color.
Figure 20: Example of an error on an image conatining two dominant colors.
The following key conclusions were drawn:
- Adding SE blocks enhanced the model's ability to focus on important spatial and channel information, leading to a noticeable improvement in MAE and MSE. This suggests that channel-wise attention mechanisms are effective for tasks involving global features like dominant color detection.
- Introducing dropout helped reduce overfitting, particularly in the model with two SE blocks. This regularization technique ensured better generalization and further reduced errors.
- While more complex architectures performed better, the improvements might have been constrained by the dataset size. This emphasizes the need for a balance between architecture complexity and available data, potentially justifying the success of techniques like pre-trained weights and attention mechanisms.
- Challenges remain with multiple prominent color images, suggesting avenues for future research. Expanding the dataset to include more of these cases would most likely significantly improve model performance.
- House Plant Species dataset on Kaggle
- Measuring colourfulness in natural images by David Hasler and Sabine Süsstrunk
- Computing image “colorfulness” with OpenCV and Python by Adrian Rosebrock
- A Complete Guide to Picture Complexity Assessment Using Entropy by unimatrixz.com
- Image Enhancement with Python by Sandaruwan Herath
- Deep Residual Learning for Image Recognition by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
- Squeeze-and-Excitation Networks by Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu