Multilabel Tag Prediction on StackOverflow Questions

Introduction

The current case-study revolves around tag prediction of Stack Overflow questions on programming topics. We utilise StackSample Kaggle dataset, which represents approximately 10% of Stack Overflow Q&A corpus. More specifically we only utilise the following files:

Questions.csv
- A unique identifier of the user that created each question.
- A unique identifier of the question itself.
- Creation and closing datetimes corresponding to each question.
- The cumulative reaction score of each question (zero, positive or negative).
- The title and main body of each question.
Tags.csv
- A unique identifier for each question.
- One or more associated tags.

I am utilising a diverse range of NLP tools and models, spanning from basic ML traditional models to advanced neural networks, like BERT that has been fine-tuned for the specific task. Finally we merge all the results together so as to compare from metrics and efficiency perspective.

All three following notebooks can be configured with desired number M, which represents number of Top Tag Combinations, and retrieve the respective dataset's subset for experimental purposes.

EDA

Due to time and resources limitations, from the beginning of this project, we knew that we must retain a proper subset of the original dataset. After merging two csv files, we aimed to identify insights for Tags via plots and statistics, thus leading us to the result to experiment in keeping only top N tags. After some trial and errors, we decided it (for future ease) to experiment in keeping with the top M tag combinations. Note that we decided to opt only the Questions with positive cummulative score, as we believe this kind of questions, most of the times provide valuable insights and solutions, therefore more quality data.

The biggest time allocation of this project was for sure EDA, and one of its subgoals, which is preprocessing. We gave a strong focus with many manual examples exploration, in order to reassure pipeline's stability. We treated specially words like the name of a Tag, because they give a strong weight to prediction. In addition, most of the tags are programming languages, packages, versions, systems and other and due to this fact we were very cautius with punctuation (C#,C++,.NET).

Preprocessing Pipeline:

Lowercase
Cleaning
- Removal of noise from tag words
- Removal of HTML tags
- Removal of IPV4 adresses
- Removal of URLs
- Removal of redundant symbols (spaces, newline etc.)
- Fix Contractions
Tokenization (NLTK) on words!=tags
Punctuation Removal (except of '.-#+')
Removal of Stopwords
Lemmatization
Removal of only-puncts words
Removal of only-digit words
Joining

After a single run of EDA.ipynb notebook , having opted M value, it is produced a preprocessed.csv, which contains data with top M tag combinations, ready to manipulate afterwards on model notebooks. In that way, code is more generic and we have the ability to store different subsets of original dataset, as we prefer.

Baseline Model

Baseline_Model.ipnyb notebook defines the M number in order to retrieve, whichever preprocessed version of dataset the user prefers.

Our problem is multilabel, thus we already know that Tag may be more than one and this is a tricky point that we need to focus about splitting the original dataset properly. We implemented a function which will be utilised in both model notebooks, whose goal is to split the data properly into (train,test) or (train,val,test) with configurable sizes. Through this manual function we are reassured that our final sets are balanced in of all the possible Tag Combinations.

Tip

To split data properly and result to well distributed train,val and test sets try convert multilabel to single label by calculating all unique label combinations. With that logic we are pretty sure that sets will be balanced and contain a based-on-ratio-equally number of examples, leading to sufficient model's exposure to all cases.

Structure Points:

TF-IDF (Term Frequency-Inverse Document Frequency) was used for text feature extraction with max_features=20000, considering also uni,bi and trigrams.
MultiLabelBinarizer was used to convert tags to binary vector representation.
Multiple Sklearn Models were used, but the most competitive performance derives from Linear SVC.
Hamming Loss, Micro-F1 and Macro-F1 were the metrics opted for evaluation.
KFold Cross Validation for the best model, ensuring that the model's performance is consistent across different data's subsets.
Classification Report for each Tag.

LLM Model

LLM_Model.ipynb notebook , with similar thought process, defines the M number in order to retrieve, whichever preprocessed version of dataset the user prefers.

A series of experiments took place in Google Colab, where all notebooks run, utilising for BERT GPU, due to its heavy architecture. As Devlin proposed for finetuning tasks, we experiment for each data's subset with batch size: {16,32} and a small number of epochs: 3 (GPU constraint).

Note

The following two code blocks produce the exact same result, we utilise the first one for experimenting manually.

class BERTModel(torch.nn.Module):
    def __init__(self, num_labels):
        super(BERTModel, self).__init__()
        self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased', return_dict=False)
        self.l2 = torch.nn.Dropout(0.1)
        self.l3 = torch.nn.Linear(768, num_labels)

    def forward(self, ids, mask):
        _, output_1= self.l1(ids, attention_mask = mask)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

model = BERTModel(len(unique_tags))

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(unique_tags))

Structure Points:

BERT was choosed as architecture ('bert-base-uncased') and construct a Pytorch model comprised from BERT and on top a trainable classification layer.
MultiLabelBinarizer was used to convert tags to binary vector representation.
Pytorch Dataset & Dataloader made the process of manipulating data to feed them into the model smooth.
BCEWithLogitsLoss utilised as loss function, well-suited for multilabel problems, as it calculates the loss for each label independently.
Pytorch Finetuning and respective Learning Curve.
Hamming Loss, Micro-F1 and Macro-F1 were the metrics opted for evaluation.
Classification Report for each Tag.

Repository Folder Structure

.
├── ...
├── notebooks                    
│   ├── Tag_Combinations_M               # M: Top Tag Combinations (Different Dataset's Subsets)
    │   ├── EDA_M.ipynb                   
    │   ├── Baseline_Model_M.ipynb       
    │   ├── LLM_Model_M_16.ipynb         # Batch Size: 16
    │   ├── LLM_Model_M_32.ipynb         # Batch Size: 32
├── src                                  #  Python code of nbs 
|   ├── EDA.py                         
|   ├── Baseline_Model.py                
|   ├── LLM_Model.py
├── images                               #  Images needed
│   ├── stackoverflow.png         
├── README.md                   
├── ...

Experiment Results

EDA			Linear SVC			BERT
Questions	Tag Combinations	Unique Tags	Hamming Loss	Micro-F1	Macro-F1	Hamming Loss	Micro-F1	Macro-F1	Epoch GPU (Batch Size)
33.374	20	16	0.03	0.83	0.81	0.02	0.86	0.85	12 (32)
42.369	35	28	0.02	0.80	0.79	0.01	0.85	0.85	16 (16)
48.505	50	38	0.01	0.79	0.77	0.01	0.84	0.81	18 (16)
62.118	100	74	0.01	0.78	0.70	0.01	0.82	0.70	21 (16)
70.474	150	104	0.01	0.76	0.68	0.01	0.80	0.66	26 (16)
76.766	200	133	MEMORY RAM CRASH

Unfortunately on the experiment procedure, we face RAM issues with 200 Top Tag Combinations, therefore we had to limit ourselves in smaller subsets of original dataset.

Conclusions

Baseline model is Linear Support Vector Machines and except of its pretty descent performance in all of our experiments, we have to mention that was very time efficient, managing to train in just less than a minute. We need to inspect the broader picture, as this traditional ML model literally in seconds achieves, based on our metrics, very satisfying results. For example, if we concentrate on row with 38 Unique Tags, Linear SVC managed to record Micro-F1 0.79. A significant note is that while number of examples increasing time Linear SVC needs to be trained isn't directly affected, thus from computational perspective it's the absolute winner.
For a more sophisticated model we opted from HuggingFace library BERT 'bert-base-uncased' model added on top classification head. Our choice is justified by the fact that we have a multilabel classification problem and also BERT is very popular for text classification. For all experiments, BERT consistently outperforms Linear SVC in terms of Hamming Loss, Micro-F1, and Macro-F1 scores, indicating better accuracy in predicting the labels. However, finetuning of BERT required extensive GPU usage the larger the dataset the more minutes each epoch needed.

All in all, it can be concluded that BERT generally outperforms Linear SVC for this classification task. However, it's essential to consider factors such as computational resources and model complexity when choosing between the two approaches.

Future Work

For sure with plenty of time, just a few epochs more could lead to a bit better results.
Undoubtedly allocate more time to preprocess and explore further the data.
Extensive hyperparameter tuning (i.e learning rate, optimizer, dropout).
Conduct experiments with Sentence Transformers for embeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilabel Tag Prediction on StackOverflow Questions

Introduction

EDA

Preprocessing Pipeline:

Baseline Model

Structure Points:

LLM Model

Structure Points:

Repository Folder Structure

Experiment Results

Conclusions

Future Work

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
images		images
notebooks		notebooks
src		src
README.md		README.md

spyros-briakos/Multilabel_StackOverflow_Tag_Prediction

Folders and files

Latest commit

History

Repository files navigation

Multilabel Tag Prediction on StackOverflow Questions

Introduction

EDA

Preprocessing Pipeline:

Baseline Model

Structure Points:

LLM Model

Structure Points:

Repository Folder Structure

Experiment Results

Conclusions

Future Work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages