Skip to content

maruf9921/arxiv-nlp-topic-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

arXiv NLP Topic Modeling

An end-to-end R pipeline that collects Computer Science research papers from the arXiv API, preprocesses their abstracts using NLP techniques, and discovers latent topics via Latent Dirichlet Allocation (LDA).


Table of Contents


Overview

This project automates the full lifecycle of topic modeling on arXiv papers:

  1. Data Collection — Queries arXiv's API for Computer Science (cs.*) papers across a configurable date range, using chunked date windows and retry logic to ensure completeness.
  2. Text Preprocessing — Cleans raw abstracts through lowercasing, contraction expansion, punctuation removal, stopword filtering, and lemmatization.
  3. Topic Modeling — Trains a 10-topic LDA model (Gibbs sampling) on the preprocessed corpus and produces term distributions, document-topic assignments, and word cloud visualizations.

Pipeline Architecture

arXiv API
    │
    ▼
data_collection_api.r        ──►  arxiv_cs_complete(2024-2025).csv
    │
    ▼
text_data_preprocessing.r    ──►  token_text_arxiv_preprocessed.csv
    │
    ▼
modeling_topic_lda.r         ──►  lda_topic_terms.csv
                                  lda_document_topics_labeled.csv
                                  wordclouds_all_topics.pdf
                                  lda_model.rds / dtm.rds / corpus.rds

Project Structure

arxiv-nlp-topic-modeling/
├── data_collection_api.r          # Step 1: arXiv API data collection
├── text_data_preprocessing.r      # Step 2: NLP text cleaning & tokenization
├── modeling_topic_lda.r           # Step 3: LDA topic modeling & visualization
└── README.md

Discovered Topics

The LDA model identifies 10 latent topics from the Computer Science literature:

# Topic Label
1 Algorithms & Graph Theory
2 User Studies & HCI
3 NLP & Large Language Models
4 Network Systems & Performance
5 Machine Learning
6 Deep Learning & Neural Networks
7 Scientific Research & Applications
8 Control Systems & Robotics
9 Mathematical Optimization
10 Computer Vision & Multimedia

Requirements

  • R >= 4.0
  • The following R packages:
Stage Packages
Data Collection aRxiv, dplyr, lubridate, purrr, readr, stringr
Text Preprocessing dplyr, stringr, textclean, textstem, tm, tokenizers
Topic Modeling tm, topicmodels, tidytext, broom, dplyr, readr, ggplot2, scales, tidyr, stringr, slam, reshape2, wordcloud, RColorBrewer

Installation

Install all required packages from an R console:

install.packages(c(
  "aRxiv", "dplyr", "lubridate", "purrr", "readr", "stringr",
  "textclean", "textstem", "tm", "tokenizers",
  "topicmodels", "tidytext", "broom", "ggplot2", "scales",
  "tidyr", "slam", "reshape2", "wordcloud", "RColorBrewer"
))

Usage

Run each script in order:

Step 1 — Collect Data

source("data_collection_api.r")

Collects all cs.* arXiv papers from 2025-01-01 to the current date in 2-month chunks. Saves incremental backups and writes the final dataset to arxiv_cs_complete(2024-2025).csv.

Key parameters inside collect_full_arxiv():

Parameter Default Description
start_date "2025-01-01" Start of the collection window
end_date Sys.Date() End of the collection window
output_file "arxiv_cs_complete(2024-2025).csv" Output CSV filename
max_retries 5 Max retry attempts per date chunk

Step 2 — Preprocess Text

source("text_data_preprocessing.r")

Reads the collected CSV, cleans abstracts, tokenizes, removes stopwords, and lemmatizes. Outputs the processed corpus to token_text_arxiv_preprocessed.csv.

Step 3 — Run Topic Modeling

source("modeling_topic_lda.r")

Builds a Document-Term Matrix, trains LDA with Gibbs sampling (k=10, burnin=1000, iter=2000), and generates all outputs.


Outputs

File Description
arxiv_cs_complete(2024-2025).csv Raw collected arXiv paper metadata
token_text_arxiv_preprocessed.csv Cleaned & tokenized abstracts
lda_topic_terms.csv Top 10 terms per topic with beta scores
lda_document_topics_labeled.csv Per-document topic probability (gamma)
wordclouds_all_topics.pdf Word cloud for the full corpus
topic_wordclouds_all_terms.pdf Individual word clouds per topic
lda_model.rds Serialized LDA model object
dtm.rds Serialized Document-Term Matrix
corpus.rds Serialized VCorpus object

About

Scrapes arXiv papers, preprocesses text data, and performs topic modeling using LDA.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages