Fashion Pixel Project

Goal 1: Automatically label each pixel as either person or background

Goal 2: Automatically label each pixel as either: background, skin, hair, t-shirt, shoes, pants, dress

Approaches

Convolutional neural networks
Edge detection
Conditional random fields
Super-pixels
Region Adjacency Graphs
K-means and Clustering

U-2-NET

reference:

Qin, Xuebin and Zhang, Zichen and Huang, Chenyang and Dehghan, Masood and Zaiane, Osmar and Jagersand, Martin, "U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection", 2020; https://arxiv.org/pdf/2005.09007.pdf

https://github.com/xuebinqin/U-2-Net

Design 👷‍♂️

model architecture

U-2-Net uses a unique two-level nested U-structure architecture, allowing for deeper network exploration and high-resolution attainment without large memory and computation spikes. The network has a U-structure design. The lower-level features are captured by a residual U-block, which extracts intra-stage multi-scale features and preserves feature map resolution. The upper level mirrors a U-Net configuration, with each stage populated by an RSU block.

training process

To train the model, the suggested loss function is a multi-binary-cross entropy loss fusion function, which takes a weighted sum of the outputs from the different layers before they are fused.

I used the same train-valid-test split as suggested; however, I randomized the image ordering and used a fixed seed. When training, I used a batch size of 12 images and a 1 image batch for validation. When training this model, I did find it useful to apply random transformations to the images and observed higher test accuracy when doing so. I used the Adam optimizer with a learning rate of 0.001 and saved a snapshot of the model every 2000 epochs.

The training time to train this model was very long, and the model was very inaccurate unless it was trained for many epochs. I decided to not use this model and use it for tests because making new iterations and testing became very slow.

evaluation

I used a combination of Mean IOU and Global Pixel wise accuracy to evaluate this model. I evaluated the model on a random set of 240 images from the dataset (40% of the dataset).

Mean IOU: ~ 96%
Global Pixel wise accuracy: ~ 99%

I only evaluated the model on the person/background pixel labeling task since I did not move forward using this model.

Examples 🎨

The following predictions were made on a model I trained for 14000 epochs.

I added a post-processing step for this model that thresholds some of the pixel values you can see there are some blurry spots from when the images are fused.


before	after

Key findings 🔑 & Challenges 🗻

Convolutional neural networks are not a one-step solution to solving problems. Architecture details have an impact on many parts of the design, including the training step and any post-processing that needs to be done.

Long training times and no way to load pre-trained weights were hard to get around. Training the model took so long that I decided not to train a multi-label model. This model was very complex and difficult to train, so I ended up not going past a trial stage using it.

Future work 🔮

The researchers that created U-2-NET also created a smaller version that takes up less space and may be faster to train. This may be something to look into if I explore this model again. I would also spend more time training a U-2-NET model to make multi-label pixel predictions and working on caching some pre-trained weights to speed up training.

DeepLabv3

reference:

Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam, "Rethinking Atrous Convolution for Semantic Image Segmentation", 2017; https://arxiv.org/pdf/1706.05587.pdf

https://pytorch.org/hub/pytorch_vision_deeplabv3_resnet101

https://github.com/pytorch/vision https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py