Text Mining for AI - NLP Project

Project Description

This project demonstrates the practical applications of natural language processing (NLP) on public datasets from Hugging Face. We explore three main NLP tasks at the sentence level: sentiment analysis, topic classification, and named entity recognition (NERC). The analysis is performed using models and tools explored during the Text Mining for AI course at Vrije Universiteit Amsterdam.

Our goal was to evaluate different approaches - from rule-based systems to state-of-the-art machine learning models - and understand their strengths and limitations in real-world text mining applications.

Project Report

📄 See the PDF in the repository for full poster

Research Objectives

Compare rule-based vs. machine learning approaches for sentiment analysis
Evaluate unsupervised topic modeling effectiveness on sentence-level classification
Assess pretrained NER models performance on domain-specific text
Understand the challenges of sentence-level text classification versus document-level analysis

Methodology & Implementation

1. Sentiment Analysis

We implemented and compared three approaches for sentiment polarity classification:

VADER (Rule-based)

Version 1: Original implementation using compound scores
- Positive if compound > 0, Negative if < 0, Neutral if = 0
- Result: Poor performance with 0% recall for neutral class
Version 2: Optimized thresholds
- Positive if compound ≥ 0.05, Negative if ≤ -0.05, Neutral otherwise
- Result: Slight improvement but still limited accuracy

SVM (Supervised ML)

Pipeline: TF-IDF vectorization → Linear SVM classifier
Training Data: 27,000 labeled tweets from HuggingFace
Result: Significantly outperformed VADER approaches

# Key Results
SVM Performance:
- Positive: Precision 0.67, Recall 0.67, F1 0.67
- Neutral:  Precision 0.43, Recall 0.50, F1 0.46  
- Negative: Precision 0.60, Recall 0.50, F1 0.55

2. Topic Classification

Implemented unsupervised Latent Dirichlet Allocation (LDA) for topic discovery:

Data Processing

Dataset: BBC News All-Time dataset

Preprocessing:

SpaCy for tokenization and lemmatization
Removed stopwords, punctuation, numbers
Filtered domain-specific noise (media sources, dates, locations)
Excluded pronouns, determiners, prepositions, auxiliaries

Model Configuration

LDA Parameters:
- Number of topics: 10
- Passes: 10
- Random state: 42
- Corpus: Gensim dictionary with filtered extremes

Results

Accuracy: 15% (macro-averaged)
Challenge: LDA discovered coherent topics but failed to align with gold standard labels
Finding: Topics showed clear separation (sports, politics, multimedia) but document-level model struggled with sentence-level classification

3. Named Entity Recognition

Two approaches for entity extraction:

Approach 1: Fine-tuned BERT

Training: OntoNotes 5.0 English dataset
Method: Token-level classification with IOB2 tagging
Entities: PERSON, ORG, LOC, DATE, TIME

Approach 2: Pretrained spaCy

Model: en_core_web_trf
Mapping: 4-tag schema (PER, ORG, LOC, MISC)
Evaluation: Exact match criteria

Results: Both approaches showed limited performance (F1 ≈ 0.04) due to:

Domain mismatch between training and test data
Differences in tokenization and span boundaries
Entity type granularity misalignment

Key Findings & Analysis

Sentiment Analysis Insights

The supervised SVM model clearly outperformed zero-shot VADER sentiment estimation. Results demonstrate VADER's limitations on domain-specific text, with the SVM showing more consistency despite not achieving outstanding results.

Topic Classification Analysis

LDA successfully identified distinct topics but failed to align with sentence-level classes. The unsupervised nature and document-length optimization were major limitations. The pyLDAvis visualization showed that news articles tend to follow consistent linguistic patterns, with specialized content (tech, science) clearly separated.

NERC Performance

While the pretrained model recognized relevant entities including multi-token names, differences in span boundaries and domain-specific vocabulary led to many mismatches. Results showed precision of 0.048, recall of 0.034, and F1-score of 0.040.

Contributors & Work Division

Team Member	Contributions
Felice Faruolo	Topic Classification (preprocessing + LDA implementation), Dataset and analysis, Visualizations
Max Schuringa	VADER implementation & analysis, Dataset preparation, Methodology documentation
Sofia Vida	VADER + SVM implementation, SVM analysis, Layout and formatting
Alejandra Pampillon	NERC implementation & analysis, All NER components

References

OntoNotes 5.0 English Dataset - HuggingFace
Tweet Sentiment Extraction - Kaggle SemEval-2018
BBC News All Time Dataset - HuggingFace RealTimeData

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
NER-onto (1).ipynb		NER-onto (1).ipynb
NER-test.tsv		NER-test.tsv
NER_pretrainedmethod.ipynb		NER_pretrainedmethod.ipynb
Poster Text Mining.pdf		Poster Text Mining.pdf
README.md		README.md
Sentiment-Analysis.ipynb		Sentiment-Analysis.ipynb
Topic-Analysis.ipynb		Topic-Analysis.ipynb
sentiment-topic-test.tsv		sentiment-topic-test.tsv
vader_sentiment_analysis.csv		vader_sentiment_analysis.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Mining for AI - NLP Project

Project Description

Project Report

Research Objectives

Methodology & Implementation

1. Sentiment Analysis

VADER (Rule-based)

SVM (Supervised ML)

2. Topic Classification

Data Processing

Model Configuration

Results

3. Named Entity Recognition

Approach 1: Fine-tuned BERT

Approach 2: Pretrained spaCy

Key Findings & Analysis

Sentiment Analysis Insights

Topic Classification Analysis

NERC Performance

Contributors & Work Division

References

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

felixfaruix/Multi-Task-NLP-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Text Mining for AI - NLP Project

Project Description

Project Report

Research Objectives

Methodology & Implementation

1. Sentiment Analysis

VADER (Rule-based)

SVM (Supervised ML)

2. Topic Classification

Data Processing

Model Configuration

Results

3. Named Entity Recognition

Approach 1: Fine-tuned BERT

Approach 2: Pretrained spaCy

Key Findings & Analysis

Sentiment Analysis Insights

Topic Classification Analysis

NERC Performance

Contributors & Work Division

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages