Transforming Text Data Into Insights: Analyzing Current Research Trends

Unsupervised Machine Learning

Transforming Text Data Into Insights: Analyzing Current Research Trends

General Info

Project Overview

The field of scientific study is constantly changing, with new trends and topics of interest appearing on a regular basis. Identifying these trends is crucial for both scientific institutions and enterprises looking to navigate their strategic direction. This project aims to understand the current trends in science by analyzing scholarly publications using Natural Language Processing (NLP) techniques. The project is based on the arXiv database, which can be downloaded at https://www.kaggle.com/datasets/Cornell-University/arxiv/data.

Project Objective and Methods

This project's primary goal is to use textual data analysis to extract insightful information and spot trends in scientific papers. The project uses NLP algorithms on the arXiv dataset in order to do this. The key steps are as follows::

Clustering Research Articles: Utilizing a k-Means clustering model, the project groups research articles with similar content. It employs Principal Component Analysis (PCA) for Dimensionality Reduction.
Topic Modeling Analysis: Term Frequency-Inverse Document Frequency (TF-IDF) is used to identify overarching subjects within the clusters.

The code for this project was developed as part of a university project for B.Sc. Data Science, with a focus on Unsupervised Learning and Feature Engineering.

Results

The project successfully identified four clusters of research articles and extracted the top 20 keywords for each cluster, providing insights into overarching themes and trends within the scientific publications.

Key Skills Learned

Textual Data Analysis and NLP: The project involved analyzing and processing textual data from scientific publications, including techniques like TF-IDF and clustering.
Machine Learning: Implementing unsupervised learning techniques such as k-Means clustering and PCA.
Data Visualization: Creating visual representations of data using libraries like Matplotlib, Seaborn, and Yellowbrick.
Data Cleaning and Preprocessing: Preparing and cleaning data for analysis, including lemmatization.

Installation

Requirements:

Make sure you have Python 3.7+ installed on your computer. You can download the latest version of Python here.

Req. Packages:

matplotlib==3.6.2
pandas==1.5.2
requests==2.28.1
beautifulsoup4==4.11.1
nltk==3.7 numpy==1.23.5
spacy==3.5.3
scikit-learn==1.2.0
yellowbrick==1.5
seaborn==0.12.2
wordcloud==1.9.2

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
data_processing.py		data_processing.py
parse_metadata_arxiv.py		parse_metadata_arxiv.py
text_clustering.py		text_clustering.py
visualizations.py		visualizations.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transforming Text Data Into Insights: Analyzing Current Research Trends

Table of Contents

General Info

Installation

About

Languages

Kathrin-92/Unsupervised-ML-Trends-in-Science-DLBDSMLUSL01

Folders and files

Latest commit

History

Repository files navigation

Transforming Text Data Into Insights: Analyzing Current Research Trends

Table of Contents

General Info

Installation

About

Topics

Resources

Stars

Watchers

Forks

Languages