TruthGuard is a Python NLP project that classifies COVID-19 news articles, separating evidence-based reporting from conspiracy theories. It’s built to help combat misinformation and promote reliable information during the pandemic.
source for text: Reuters
- Word2Vec model: for generating meaningful word embeddings
- Sci-kit Learn Library: to train various machine learning models.
- Spacy package: utilized for advanced text processing.
- Pandas & Matplotlib: for data manipulation and visualization
- Chart.js: for visualizing prediction data
- Regular Expressions: for cleaning and preparing the textual data.
- Beautiful Soup: for intelligent parsing of web scrapage.
- Newspaper3k package: to extract complete news articles.
My journey started with identifying websites labeled as pro-science or conspiracy-themed using MediaBiasFactCheck. To gather data, I built a custom scraper with Beautiful Soup that pulled metadata from the latest COVID-19 articles on these sites.
Using Newspaper3k, I retrieved the full text of relevant articles. The data was then cleaned and refined with SpaCy and regular expressions—removing dates, links, stop words, and applying lemmatization to create a more analyzable dataset.
To capture the semantic meaning of each article, I applied the pre-trained Word2Vec Google News (300d) model, generating embeddings that reflected the nuanced language of news content.
Finally, I split the dataset into training and test sets and trained multiple machine learning models from Scikit-learn—including Logistic Regression, Support Vector Machine, Linear Discriminant Analysis, Naive Bayes, and Decision Tree Classifier. Each model was evaluated to identify the most effective approach for accurately classifying articles.TruthGuard stands as a testament to the power of combining advanced NLP techniques and machine learning to illuminate the truth in a world overwhelmed with misinformation.