The current case-study revolves around tag prediction of Stack Overflow questions on programming topics. We utilise StackSample Kaggle dataset, which represents approximately 10% of Stack Overflow Q&A corpus. More specifically we only utilise the following files:
Questions.csv
- A unique identifier of the user that created each question.
- A unique identifier of the question itself.
- Creation and closing datetimes corresponding to each question.
- The cumulative reaction score of each question (zero, positive or negative).
- The title and main body of each question.
Tags.csv
- A unique identifier for each question.
- One or more associated tags.
I am utilising a diverse range of NLP tools and models, spanning from basic ML traditional models to advanced neural networks, like BERT that has been fine-tuned for the specific task. Finally we merge all the results together so as to compare from metrics and efficiency perspective.
All three following notebooks can be configured with desired number M, which represents number of Top Tag Combinations, and retrieve the respective dataset's subset for experimental purposes.
Due to time and resources limitations, from the beginning of this project, we knew that we must retain a proper subset of the original dataset. After merging two csv files, we aimed to identify insights for Tags via plots and statistics, thus leading us to the result to experiment in keeping only top N tags. After some trial and errors, we decided it (for future ease) to experiment in keeping with the top M tag combinations. Note that we decided to opt only the Questions with positive cummulative score, as we believe this kind of questions, most of the times provide valuable insights and solutions, therefore more quality data.
The biggest time allocation of this project was for sure EDA, and one of its subgoals, which is preprocessing. We gave a strong focus with many manual examples exploration, in order to reassure pipeline's stability. We treated specially words like the name of a Tag, because they give a strong weight to prediction. In addition, most of the tags are programming languages, packages, versions, systems and other and due to this fact we were very cautius with punctuation (C#,C++,.NET).
- Lowercase
- Cleaning
- Removal of noise from tag words
- Removal of HTML tags
- Removal of IPV4 adresses
- Removal of URLs
- Removal of redundant symbols (spaces, newline etc.)
- Fix Contractions
- Tokenization (NLTK) on words!=tags
- Punctuation Removal (except of '.-#+')
- Removal of Stopwords
- Lemmatization
- Removal of only-puncts words
- Removal of only-digit words
- Joining
After a single run of EDA.ipynb
notebook , having opted M value, it is produced a preprocessed.csv, which contains data with top M tag combinations, ready to manipulate afterwards on model notebooks. In that way, code is more generic and we have the ability to store different subsets of original dataset, as we prefer.
Baseline_Model.ipnyb
notebook defines the M number in order to retrieve, whichever preprocessed version of dataset the user prefers.
Our problem is multilabel, thus we already know that Tag may be more than one and this is a tricky point that we need to focus about splitting the original dataset properly. We implemented a function which will be utilised in both model notebooks, whose goal is to split the data properly into (train,test) or (train,val,test) with configurable sizes. Through this manual function we are reassured that our final sets are balanced in of all the possible Tag Combinations.
Tip
To split data properly and result to well distributed train,val and test sets try convert multilabel to single label by calculating all unique label combinations. With that logic we are pretty sure that sets will be balanced and contain a based-on-ratio-equally number of examples, leading to sufficient model's exposure to all cases.
- TF-IDF (Term Frequency-Inverse Document Frequency) was used for text feature extraction with max_features=20000, considering also uni,bi and trigrams.
- MultiLabelBinarizer was used to convert tags to binary vector representation.
- Multiple Sklearn Models were used, but the most competitive performance derives from Linear SVC.
- Hamming Loss, Micro-F1 and Macro-F1 were the metrics opted for evaluation.
- KFold Cross Validation for the best model, ensuring that the model's performance is consistent across different data's subsets.
- Classification Report for each Tag.
LLM_Model.ipynb
notebook , with similar thought process, defines the M number in order to retrieve, whichever preprocessed version of dataset the user prefers.
A series of experiments took place in Google Colab, where all notebooks run, utilising for BERT GPU, due to its heavy architecture. As Devlin proposed for finetuning tasks, we experiment for each data's subset with batch size: {16,32} and a small number of epochs: 3 (GPU constraint).
Note
The following two code blocks produce the exact same result, we utilise the first one for experimenting manually.
class BERTModel(torch.nn.Module):
def __init__(self, num_labels):
super(BERTModel, self).__init__()
self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased', return_dict=False)
self.l2 = torch.nn.Dropout(0.1)
self.l3 = torch.nn.Linear(768, num_labels)
def forward(self, ids, mask):
_, output_1= self.l1(ids, attention_mask = mask)
output_2 = self.l2(output_1)
output = self.l3(output_2)
return output
model = BERTModel(len(unique_tags))
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(unique_tags))
- BERT was choosed as architecture ('bert-base-uncased') and construct a Pytorch model comprised from BERT and on top a trainable classification layer.
- MultiLabelBinarizer was used to convert tags to binary vector representation.
- Pytorch Dataset & Dataloader made the process of manipulating data to feed them into the model smooth.
- BCEWithLogitsLoss utilised as loss function, well-suited for multilabel problems, as it calculates the loss for each label independently.
- Pytorch Finetuning and respective Learning Curve.
- Hamming Loss, Micro-F1 and Macro-F1 were the metrics opted for evaluation.
- Classification Report for each Tag.
.
├── ...
├── notebooks
│ ├── Tag_Combinations_M # M: Top Tag Combinations (Different Dataset's Subsets)
│ ├── EDA_M.ipynb
│ ├── Baseline_Model_M.ipynb
│ ├── LLM_Model_M_16.ipynb # Batch Size: 16
│ ├── LLM_Model_M_32.ipynb # Batch Size: 32
├── src # Python code of nbs
| ├── EDA.py
| ├── Baseline_Model.py
| ├── LLM_Model.py
├── images # Images needed
│ ├── stackoverflow.png
├── README.md
├── ...
EDA | Linear SVC | BERT | ||||||||
Questions | Tag Combinations | Unique Tags | Hamming Loss | Micro-F1 | Macro-F1 | Hamming Loss | Micro-F1 | Macro-F1 | Epoch GPU (Batch Size) | |
33.374 | 20 | 16 | 0.03 | 0.83 | 0.81 | 0.02 | 0.86 | 0.85 | 12 (32) | |
42.369 | 35 | 28 | 0.02 | 0.80 | 0.79 | 0.01 | 0.85 | 0.85 | 16 (16) | |
48.505 | 50 | 38 | 0.01 | 0.79 | 0.77 | 0.01 | 0.84 | 0.81 | 18 (16) | |
62.118 | 100 | 74 | 0.01 | 0.78 | 0.70 | 0.01 | 0.82 | 0.70 | 21 (16) | |
70.474 | 150 | 104 | 0.01 | 0.76 | 0.68 | 0.01 | 0.80 | 0.66 | 26 (16) | |
76.766 | 200 | 133 | MEMORY RAM CRASH |
Unfortunately on the experiment procedure, we face RAM issues with 200 Top Tag Combinations, therefore we had to limit ourselves in smaller subsets of original dataset.
-
Baseline model is Linear Support Vector Machines and except of its pretty descent performance in all of our experiments, we have to mention that was very time efficient, managing to train in just less than a minute. We need to inspect the broader picture, as this traditional ML model literally in seconds achieves, based on our metrics, very satisfying results. For example, if we concentrate on row with 38 Unique Tags, Linear SVC managed to record Micro-F1 0.79. A significant note is that while number of examples increasing time Linear SVC needs to be trained isn't directly affected, thus from computational perspective it's the absolute winner.
-
For a more sophisticated model we opted from HuggingFace library BERT 'bert-base-uncased' model added on top classification head. Our choice is justified by the fact that we have a multilabel classification problem and also BERT is very popular for text classification. For all experiments, BERT consistently outperforms Linear SVC in terms of Hamming Loss, Micro-F1, and Macro-F1 scores, indicating better accuracy in predicting the labels. However, finetuning of BERT required extensive GPU usage the larger the dataset the more minutes each epoch needed.
All in all, it can be concluded that BERT generally outperforms Linear SVC for this classification task. However, it's essential to consider factors such as computational resources and model complexity when choosing between the two approaches.
- For sure with plenty of time, just a few epochs more could lead to a bit better results.
- Undoubtedly allocate more time to preprocess and explore further the data.
- Extensive hyperparameter tuning (i.e learning rate, optimizer, dropout).
- Conduct experiments with Sentence Transformers for embeddings.