arXiv NLP Topic Modeling

An end-to-end R pipeline that collects Computer Science research papers from the arXiv API, preprocesses their abstracts using NLP techniques, and discovers latent topics via Latent Dirichlet Allocation (LDA).

Overview

This project automates the full lifecycle of topic modeling on arXiv papers:

Data Collection — Queries arXiv's API for Computer Science (cs.*) papers across a configurable date range, using chunked date windows and retry logic to ensure completeness.
Text Preprocessing — Cleans raw abstracts through lowercasing, contraction expansion, punctuation removal, stopword filtering, and lemmatization.
Topic Modeling — Trains a 10-topic LDA model (Gibbs sampling) on the preprocessed corpus and produces term distributions, document-topic assignments, and word cloud visualizations.

Pipeline Architecture

arXiv API
    │
    ▼
data_collection_api.r        ──►  arxiv_cs_complete(2024-2025).csv
    │
    ▼
text_data_preprocessing.r    ──►  token_text_arxiv_preprocessed.csv
    │
    ▼
modeling_topic_lda.r         ──►  lda_topic_terms.csv
                                  lda_document_topics_labeled.csv
                                  wordclouds_all_topics.pdf
                                  lda_model.rds / dtm.rds / corpus.rds

Project Structure

arxiv-nlp-topic-modeling/
├── data_collection_api.r          # Step 1: arXiv API data collection
├── text_data_preprocessing.r      # Step 2: NLP text cleaning & tokenization
├── modeling_topic_lda.r           # Step 3: LDA topic modeling & visualization
└── README.md

Discovered Topics

The LDA model identifies 10 latent topics from the Computer Science literature:

#	Topic Label
1	Algorithms & Graph Theory
2	User Studies & HCI
3	NLP & Large Language Models
4	Network Systems & Performance
5	Machine Learning
6	Deep Learning & Neural Networks
7	Scientific Research & Applications
8	Control Systems & Robotics
9	Mathematical Optimization
10	Computer Vision & Multimedia

Requirements

R >= 4.0
The following R packages:

Stage	Packages
Data Collection	`aRxiv`, `dplyr`, `lubridate`, `purrr`, `readr`, `stringr`
Text Preprocessing	`dplyr`, `stringr`, `textclean`, `textstem`, `tm`, `tokenizers`
Topic Modeling	`tm`, `topicmodels`, `tidytext`, `broom`, `dplyr`, `readr`, `ggplot2`, `scales`, `tidyr`, `stringr`, `slam`, `reshape2`, `wordcloud`, `RColorBrewer`

Installation

Install all required packages from an R console:

install.packages(c(
  "aRxiv", "dplyr", "lubridate", "purrr", "readr", "stringr",
  "textclean", "textstem", "tm", "tokenizers",
  "topicmodels", "tidytext", "broom", "ggplot2", "scales",
  "tidyr", "slam", "reshape2", "wordcloud", "RColorBrewer"
))

Usage

Run each script in order:

Step 1 — Collect Data

source("data_collection_api.r")

Collects all cs.* arXiv papers from 2025-01-01 to the current date in 2-month chunks. Saves incremental backups and writes the final dataset to arxiv_cs_complete(2024-2025).csv.

Key parameters inside collect_full_arxiv():

Parameter	Default	Description
`start_date`	`"2025-01-01"`	Start of the collection window
`end_date`	`Sys.Date()`	End of the collection window
`output_file`	`"arxiv_cs_complete(2024-2025).csv"`	Output CSV filename
`max_retries`	`5`	Max retry attempts per date chunk

Step 2 — Preprocess Text

source("text_data_preprocessing.r")

Reads the collected CSV, cleans abstracts, tokenizes, removes stopwords, and lemmatizes. Outputs the processed corpus to token_text_arxiv_preprocessed.csv.

Step 3 — Run Topic Modeling

source("modeling_topic_lda.r")

Builds a Document-Term Matrix, trains LDA with Gibbs sampling (k=10, burnin=1000, iter=2000), and generates all outputs.

Outputs

File	Description
`arxiv_cs_complete(2024-2025).csv`	Raw collected arXiv paper metadata
`token_text_arxiv_preprocessed.csv`	Cleaned & tokenized abstracts
`lda_topic_terms.csv`	Top 10 terms per topic with beta scores
`lda_document_topics_labeled.csv`	Per-document topic probability (gamma)
`wordclouds_all_topics.pdf`	Word cloud for the full corpus
`topic_wordclouds_all_terms.pdf`	Individual word clouds per topic
`lda_model.rds`	Serialized LDA model object
`dtm.rds`	Serialized Document-Term Matrix
`corpus.rds`	Serialized VCorpus object

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arXiv NLP Topic Modeling

Table of Contents

Overview

Pipeline Architecture

Project Structure

Discovered Topics

Requirements

Installation

Usage

Step 1 — Collect Data

Step 2 — Preprocess Text

Step 3 — Run Topic Modeling

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
data_collection_api.r		data_collection_api.r
modeling_topic_lda.r		modeling_topic_lda.r
text_data_preprocessing.r		text_data_preprocessing.r

Folders and files

Latest commit

History

Repository files navigation

arXiv NLP Topic Modeling

Table of Contents

Overview

Pipeline Architecture

Project Structure

Discovered Topics

Requirements

Installation

Usage

Step 1 — Collect Data

Step 2 — Preprocess Text

Step 3 — Run Topic Modeling

Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages