Skip to content

This project evaluates different NLP approaches (rule-based, unsupervised, and supervised machine learning) across three core text mining tasks: sentiment analysis using VADER and SVM, topic classification using LDA, and named entity recognition using BERT and spaCy.

Notifications You must be signed in to change notification settings

felixfaruix/Multi-Task-NLP-Evaluation

Repository files navigation

Text Mining for AI - NLP Project

Project Description

This project demonstrates the practical applications of natural language processing (NLP) on public datasets from Hugging Face. We explore three main NLP tasks at the sentence level: sentiment analysis, topic classification, and named entity recognition (NERC). The analysis is performed using models and tools explored during the Text Mining for AI course at Vrije Universiteit Amsterdam.

Our goal was to evaluate different approaches - from rule-based systems to state-of-the-art machine learning models - and understand their strengths and limitations in real-world text mining applications.

Project Report

📄 See the PDF in the repository for full poster

image

Research Objectives

  1. Compare rule-based vs. machine learning approaches for sentiment analysis
  2. Evaluate unsupervised topic modeling effectiveness on sentence-level classification
  3. Assess pretrained NER models performance on domain-specific text
  4. Understand the challenges of sentence-level text classification versus document-level analysis

Methodology & Implementation

1. Sentiment Analysis

We implemented and compared three approaches for sentiment polarity classification:

VADER (Rule-based)

  • Version 1: Original implementation using compound scores

    • Positive if compound > 0, Negative if < 0, Neutral if = 0
    • Result: Poor performance with 0% recall for neutral class
  • Version 2: Optimized thresholds

    • Positive if compound ≥ 0.05, Negative if ≤ -0.05, Neutral otherwise
    • Result: Slight improvement but still limited accuracy

SVM (Supervised ML)

  • Pipeline: TF-IDF vectorization → Linear SVM classifier
  • Training Data: 27,000 labeled tweets from HuggingFace
  • Result: Significantly outperformed VADER approaches
# Key Results
SVM Performance:
- Positive: Precision 0.67, Recall 0.67, F1 0.67
- Neutral:  Precision 0.43, Recall 0.50, F1 0.46  
- Negative: Precision 0.60, Recall 0.50, F1 0.55

2. Topic Classification

Implemented unsupervised Latent Dirichlet Allocation (LDA) for topic discovery:

Data Processing

Dataset: BBC News All-Time dataset

Preprocessing:

  • SpaCy for tokenization and lemmatization
  • Removed stopwords, punctuation, numbers
  • Filtered domain-specific noise (media sources, dates, locations)
  • Excluded pronouns, determiners, prepositions, auxiliaries

Model Configuration

LDA Parameters:
- Number of topics: 10
- Passes: 10
- Random state: 42
- Corpus: Gensim dictionary with filtered extremes

Results

  • Accuracy: 15% (macro-averaged)
  • Challenge: LDA discovered coherent topics but failed to align with gold standard labels
  • Finding: Topics showed clear separation (sports, politics, multimedia) but document-level model struggled with sentence-level classification

3. Named Entity Recognition

Two approaches for entity extraction:

Approach 1: Fine-tuned BERT

  • Training: OntoNotes 5.0 English dataset
  • Method: Token-level classification with IOB2 tagging
  • Entities: PERSON, ORG, LOC, DATE, TIME

Approach 2: Pretrained spaCy

  • Model: en_core_web_trf
  • Mapping: 4-tag schema (PER, ORG, LOC, MISC)
  • Evaluation: Exact match criteria

Results: Both approaches showed limited performance (F1 ≈ 0.04) due to:

  • Domain mismatch between training and test data
  • Differences in tokenization and span boundaries
  • Entity type granularity misalignment

Key Findings & Analysis

Sentiment Analysis Insights

The supervised SVM model clearly outperformed zero-shot VADER sentiment estimation. Results demonstrate VADER's limitations on domain-specific text, with the SVM showing more consistency despite not achieving outstanding results.

Topic Classification Analysis

LDA successfully identified distinct topics but failed to align with sentence-level classes. The unsupervised nature and document-length optimization were major limitations. The pyLDAvis visualization showed that news articles tend to follow consistent linguistic patterns, with specialized content (tech, science) clearly separated.

NERC Performance

While the pretrained model recognized relevant entities including multi-token names, differences in span boundaries and domain-specific vocabulary led to many mismatches. Results showed precision of 0.048, recall of 0.034, and F1-score of 0.040.

Contributors & Work Division

Team Member Contributions
Felice Faruolo Topic Classification (preprocessing + LDA implementation), Dataset and analysis, Visualizations
Max Schuringa VADER implementation & analysis, Dataset preparation, Methodology documentation
Sofia Vida VADER + SVM implementation, SVM analysis, Layout and formatting
Alejandra Pampillon NERC implementation & analysis, All NER components

References

  1. OntoNotes 5.0 English Dataset - HuggingFace
  2. Tweet Sentiment Extraction - Kaggle SemEval-2018
  3. BBC News All Time Dataset - HuggingFace RealTimeData

About

This project evaluates different NLP approaches (rule-based, unsupervised, and supervised machine learning) across three core text mining tasks: sentiment analysis using VADER and SVM, topic classification using LDA, and named entity recognition using BERT and spaCy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •