The aim of this project is to automate the task of assigning an item's category, based on a photo and description of the item for sale, submitted by a seller to an e-commerce marketplace.
The volume of items is currently very small, and an item's category is assigned manually by the sellers, and is therefore unreliable.
The automation of this task is necessary:
- To improve the user experience of sellers (facilitate the posting of new articles)
- To improve the user experience of buyers (facilitate the search for products)
- To scale up to millions of items.
This is project 6 for the Master in Data Science (in French, BAC+5) from OpenClassrooms.
The project demonstrates the feasibility of automatically grouping same category products:
- pre-processing product descriptions and images
- extraction of features, from the processed data or its embedding within a model
- dimension reduction techniques
- clustering, confirmed by similarity between real categories and clusters.
- visualization of clusters of products
To run the notebooks, the dataset must be placed in a DATA_FOLDER ('data/raw'). Python libraries are
listed in requirements.txt
. Each notebook also includes a list of its own requirements, and a
procedure for pip install
of any missing libraries.
Data: A first dataset (~330Mb) of 1050 articles with photo and an associated description: the link to download
Python libraries :
numpy, pandas, matplotlib, seaborn, scikit-learn, tensorflow, yellowbrick
- text :
nltk, gensim, transformers, tensorflow_hub, tensorflow_text, wordcloud
- images :
pillow, opencv-contrib-python, tensorflow, plotly, kaleido, pydot, graphviz
Notes : Files are in French. As requested for the project, the jupyter notebooks have not been "cleaned up" : the focus is the practice of techniques for pre-processing, setting up, tuning, visualising and evaluating text/image machine learning and deep learning algorithms.
Custom functions created in this project for data pre-processing, statistical analysis and data visualization are encapsulated within each notebook, to avoid importing and versioning custom libraries. Open https://nbviewer.org/ and paste notebook GitHub url if GitHub takes too long to render.
-
P6_01_text_nlp.ipynb: Text classification techniques.
-
P6_02_image_classification.ipynb: Image classification, plus classification when combined with text features.
-
P6_03_support.pdf: Presentation and conclusion
Note : The quality of pre-processing of images and text descriptions has a huge impact on the performance of the models
Unsupervised, semi-supervised and supervised classification techniques were used for product categorization
- based only on product text descriptions
- based only on product images`
- combining features extracted from both text and images
Text classification (Natural Language Processing) was undertaken using:
- Bag of Words (BoW): word count and TF-IDF vectorization, with n-grams
- Topic Modelling using Latent Dirichlet Allocation
- Word Embedding using Word2Vec pre-trained models
- Word Embedding using deep learning (contextuel skipgrams in LSTM neural networks: BERT, HuggingFace transformers, Universal Sentence Encoder
- Keras (supervised) word embedding: train-test split, demonstrating overfitting of the training data.
Image classification (Computer Vision) was performed:
- image feature extraction : bag of visual features (SIFT, ORB) ; Visual feature vectors
- supervised training on simple Convolution Neural Networks (CNN)
- semi-supervised on VGG16 pretrained (ImageNet - 1000 features)
- unsupervised (on VGG16 features minus 2 layers)
- supervised transfer learning, with fine-tuning
- regularization through the use of image augmentation and dropout layers
Combined text and image features were used to improve the final product categorization
classification | description |
---|---|
unsupervised | K-means clustering after feature selection and dimension reduction, selecting the number of categories which provides the most distinct clusters |
semi-supervised | the number of clusters was fixed (K=7) |
supervised | the (labelled) categorised data was split into train, test and validation sets, to learn the features of each category. |
Supervised classification was conducted using neural network models:
- shallow neural networks were created to quickly test the impact of pre-processing of text and images, and regularization mechanisms
- deep neural networks were trained on the best pre-processing models
- Cleaning ("stop phrases", tokenization, stemming, stop words (identified with low IDF)
- Lemmatization (removes context, so excluded from sentence embedding models)
- Bag of Words : word count and TF-IDF vectorization (Term Frequency–Inverse Document
Frequency)
- Tuning with use of n-grams and regex patterns
To identify the most suitable category names, (semi-supervised) topic modelling was applied to the Bag-of-Words features. The TF-IDF vectorization provided a good correlation between the discovered topics and the existing 7 categories.
- Topic visualization with word clouds
The extracted features (for example, word frequencies) were reduced by principal component analysis (PCA), keeping 99% of explained variance, before applying t-distributed stochastic neighbor embedding (t-SNE) to reduce to two dimensions.
K-means clustering was applied to identify the clusters, for number of clusters ranging from 4 to 12
Automatic classification works best when the categories are clearly separated.
- elbow of distortion score
- high silhouette score
- low davies-bouldin score
Unsupervised classification produced most clearly separated clusters with 7 categories.
The performance of each model was evaluated by the multicategory confusion matrix, from which we can calculate, for each category:
- precision
- recall
- accuracy
These can be summarised in the classification report, and visualised in a Sankey Diagram
- measure of similarity between predicted and actual categories
Word embedding using word2vec is based on skipgrams: words found close together sequentially tend to be closely related, and so will have similar feature vectors. Clustering of word vectors (after dimensionality reduction by PCA and TSNE) gives the following most frequent words, coloured by cluster:
BERT (Bidirectional Encoder Representations from Transformers) and USE models were tested supplying unlemmatized descriptions to pretrained models.
Despite being deep learning models, and taking time to process the embedded words, the results were less impressive than the simpler text models.
- This may be because the product descriptions are mostly not sentences, but often generated from key-value pairs of product characteristics. Using skipgrams, the keys such as {color, length, width, height, quantity,...) may add noise rather than context. By contrast, these words have little weight in TF-IDF vectorization.
Tensorflow was used to test supervised classification (data split: 80% train, 20% test), improving the results to close to 90% accuracy on the test set after 10 epochs. However, these models overfit to the 7 categories, and are unlikely to be useful for new product categories.`
Based solely on text descriptions, clustering using TF_IDF categorization gave the best similarity with the labelled categories
Images were adjusted for
- exposition
- equalization of histogram
- noise filters
- colour/greyscale
- resize
- normalization of values to between -1 and 1
- SIFT (Scale-Invariant Feature Transform)
- ORB (Oriented FAST and Rotated BRIEF) Clustering of products after dimension reduction via PCA/t-SNE was not very clear
A simple convolution neural network composed of 2 convolution layers (with maxpooling), a dropout layer for regularization, a flattening layer and 2 dense layers was used to quickly test pre-processing pipelines, and evaluate the effect of regularization (~1 million parameters, training times of a few seconds).
For this particular problem, better results were obtained from CNN deep learning models, pretrained on millions of images
The deep learning convolution neural network VGG-16 model (2014) was used in this project. However, it can easily be replaced by other models such as ResNet (2015), Inception-V3 (2015), or EfficientNet (2019) for example.
TensorFlow provides these deep learning models, pre-trained for 1000 categories using the ImageNet dataset (14 million labelled images).
The VGG-16 pretrained model (ImageNet weights) was used to detect the probability of each image belonging to a given category. These 1000 features were reduced in dimension by PCA followed by t-SNE, using the same procedures as for text classification. The result was an ARI score of 0.38, corresponding to an accuracy of around 60%
To improve classification, the last two layers were removed, leaving 4096 underlying features instead of the 1000 categories. Applying dimension reduction and K-means clustering resulted in an ARI score of 0.53, equivalent to an accuracy of around 70%
- a simple convolution network was trained on the images.
- overfitting was observed, so image augmentation and a dropout layer were added
- the dense layers were removed and replaced with a flattening layer and new dense layers, along with a final softmax function to choose between the 7 categories.
- the convolution layers were kept and their pre-trained weights were frozen to avoid losing the pretrained image features
- fine tuning was applied by adjusting only the weights in the new dense and softmax layers, whilst freezing the pre-trained weights in the convolutional layer
- categorical crossentropy was used as the loss function
- the Adam optimization algorithm was used to for fast optimization (an extension to stochastic gradient descent)
- feature extraction using SIFT and ORB were not very successful
- unsupervised classification using features after removing the last 2 layers of VGG-16 gave the best results for classification based soley on the images
type | model | ARI score |
---|---|---|
semi-supervised | SIFT | 0.05 |
semi-supervised | ORB | 0.04 |
unsupervised | VGG-16 pretrained (1000 features) | 0.38 |
unsupervised | VGG-16 pretrained, last 2 layers removed (4096 features) | 0.53 |
supervised | Transfer Learning | 0.45 |
supervised | Transfer Learning after fine tuning | 0.50 |
The best results were obtained by combining the best text features with the best image features, resulting in an accuracy of 84%
- This can probaly be improved using more recent deep learning model for word embedding and image transfer learning.
The images are visualised on the t-SNE axes for the final model (text and image features combined):
- Remove as much noise as possible in the text descriptions (example: stop phrases) and in the
images (equalization and noise filters) :
- pre-processing has a major impact on the performance
- Replace VGG-16 with recent (pre-trained) deep learning models are faster, more efficient, more accurate
- Try out different dense layer and regularization mechanisms
- Adjust the learning rate during fine-tuning of the transfer learning model
- Add the text features extracted by TF-IDF as inputs to the deep learning model, alongside the image features extracted by convolution layers, before fine-tuning the weights of the final dense layers and the softmax layer.
- Alternatively, fine tune a Keras word embedding model, then extract the features from one of the final layers as input to the image dense layers
text classification, natural language processing (NLP)
- text pre-processing : stop phrases, tokenization, stopword, lemmatization
- text feature extraction : bag of words (Count, TF-IDF vectorization, n-grams, regex patterns)
- topic modelling : LDA – Latent Dirichlet Allocation
- topic visualization : wordClouds
- word vectors : Word2Vec, skip-grams
- word embedding : contextual skip-grams, deep learning, LSTM neural networks, BERT, HuggingFace transformers, Universal Sentence Encoder
- Keras word embedding train-test split, overfitting, variance-bias, regularization, validation set
image classification, computer vision (CV)
- image pre-processing : resize, colour/greyscale, exposition, equalization, noise filters, squarify, normalization
- image feature extraction : bag of visual features (SIFT, ORB), visual feature vectors
- convolution neural networks (CNN) : VGG16 pretrained (ImageNet) – semi-supervised (1000 features)
- unsupervised (features minus 2 layers)
- supervised image classification : transfer learning, fine-tuning,
- deep learning : pooling layers, dense layers, activation layers (reLu, softmax)
- regularization : image augmentation, dropout layers
dimensionality reduction
- PCA, t-SNE
K-means clustering
- silhouette score, distortion, intra/inter cluster variance,
- cluster similarity, adjusted rand index (ARI), multiclass confusion matrix,
- precision, recall, f1-score, classification report, sankey diagrams
- Preprocess text data to obtain a usable dataset for Natural Language Processing
- Unsupervised text classification and topic modelling techniques
- Preprocess image data to obtain a usable dataset for Computer Vision
- Implement dimension reduction techniques
- Represent large-scale data graphically