This project demonstrates the practical applications of natural language processing (NLP) on public datasets from Hugging Face. We explore three main NLP tasks at the sentence level: sentiment analysis, topic classification, and named entity recognition (NERC). The analysis is performed using models and tools explored during the Text Mining for AI course at Vrije Universiteit Amsterdam.
Our goal was to evaluate different approaches - from rule-based systems to state-of-the-art machine learning models - and understand their strengths and limitations in real-world text mining applications.
📄 See the PDF in the repository for full poster
- Compare rule-based vs. machine learning approaches for sentiment analysis
- Evaluate unsupervised topic modeling effectiveness on sentence-level classification
- Assess pretrained NER models performance on domain-specific text
- Understand the challenges of sentence-level text classification versus document-level analysis
We implemented and compared three approaches for sentiment polarity classification:
-
Version 1: Original implementation using compound scores
- Positive if compound > 0, Negative if < 0, Neutral if = 0
- Result: Poor performance with 0% recall for neutral class
-
Version 2: Optimized thresholds
- Positive if compound ≥ 0.05, Negative if ≤ -0.05, Neutral otherwise
- Result: Slight improvement but still limited accuracy
- Pipeline: TF-IDF vectorization → Linear SVM classifier
- Training Data: 27,000 labeled tweets from HuggingFace
- Result: Significantly outperformed VADER approaches
# Key Results
SVM Performance:
- Positive: Precision 0.67, Recall 0.67, F1 0.67
- Neutral: Precision 0.43, Recall 0.50, F1 0.46
- Negative: Precision 0.60, Recall 0.50, F1 0.55Implemented unsupervised Latent Dirichlet Allocation (LDA) for topic discovery:
Dataset: BBC News All-Time dataset
Preprocessing:
- SpaCy for tokenization and lemmatization
- Removed stopwords, punctuation, numbers
- Filtered domain-specific noise (media sources, dates, locations)
- Excluded pronouns, determiners, prepositions, auxiliaries
LDA Parameters:
- Number of topics: 10
- Passes: 10
- Random state: 42
- Corpus: Gensim dictionary with filtered extremes- Accuracy: 15% (macro-averaged)
- Challenge: LDA discovered coherent topics but failed to align with gold standard labels
- Finding: Topics showed clear separation (sports, politics, multimedia) but document-level model struggled with sentence-level classification
Two approaches for entity extraction:
- Training: OntoNotes 5.0 English dataset
- Method: Token-level classification with IOB2 tagging
- Entities: PERSON, ORG, LOC, DATE, TIME
- Model: en_core_web_trf
- Mapping: 4-tag schema (PER, ORG, LOC, MISC)
- Evaluation: Exact match criteria
Results: Both approaches showed limited performance (F1 ≈ 0.04) due to:
- Domain mismatch between training and test data
- Differences in tokenization and span boundaries
- Entity type granularity misalignment
The supervised SVM model clearly outperformed zero-shot VADER sentiment estimation. Results demonstrate VADER's limitations on domain-specific text, with the SVM showing more consistency despite not achieving outstanding results.
LDA successfully identified distinct topics but failed to align with sentence-level classes. The unsupervised nature and document-length optimization were major limitations. The pyLDAvis visualization showed that news articles tend to follow consistent linguistic patterns, with specialized content (tech, science) clearly separated.
While the pretrained model recognized relevant entities including multi-token names, differences in span boundaries and domain-specific vocabulary led to many mismatches. Results showed precision of 0.048, recall of 0.034, and F1-score of 0.040.
| Team Member | Contributions |
|---|---|
| Felice Faruolo | Topic Classification (preprocessing + LDA implementation), Dataset and analysis, Visualizations |
| Max Schuringa | VADER implementation & analysis, Dataset preparation, Methodology documentation |
| Sofia Vida | VADER + SVM implementation, SVM analysis, Layout and formatting |
| Alejandra Pampillon | NERC implementation & analysis, All NER components |
- OntoNotes 5.0 English Dataset - HuggingFace
- Tweet Sentiment Extraction - Kaggle SemEval-2018
- BBC News All Time Dataset - HuggingFace RealTimeData