An end-to-end R pipeline that collects Computer Science research papers from the arXiv API, preprocesses their abstracts using NLP techniques, and discovers latent topics via Latent Dirichlet Allocation (LDA).
- Overview
- Pipeline Architecture
- Project Structure
- Discovered Topics
- Requirements
- Installation
- Usage
- Outputs
This project automates the full lifecycle of topic modeling on arXiv papers:
- Data Collection — Queries arXiv's API for Computer Science (
cs.*) papers across a configurable date range, using chunked date windows and retry logic to ensure completeness. - Text Preprocessing — Cleans raw abstracts through lowercasing, contraction expansion, punctuation removal, stopword filtering, and lemmatization.
- Topic Modeling — Trains a 10-topic LDA model (Gibbs sampling) on the preprocessed corpus and produces term distributions, document-topic assignments, and word cloud visualizations.
arXiv API
│
▼
data_collection_api.r ──► arxiv_cs_complete(2024-2025).csv
│
▼
text_data_preprocessing.r ──► token_text_arxiv_preprocessed.csv
│
▼
modeling_topic_lda.r ──► lda_topic_terms.csv
lda_document_topics_labeled.csv
wordclouds_all_topics.pdf
lda_model.rds / dtm.rds / corpus.rds
arxiv-nlp-topic-modeling/
├── data_collection_api.r # Step 1: arXiv API data collection
├── text_data_preprocessing.r # Step 2: NLP text cleaning & tokenization
├── modeling_topic_lda.r # Step 3: LDA topic modeling & visualization
└── README.md
The LDA model identifies 10 latent topics from the Computer Science literature:
| # | Topic Label |
|---|---|
| 1 | Algorithms & Graph Theory |
| 2 | User Studies & HCI |
| 3 | NLP & Large Language Models |
| 4 | Network Systems & Performance |
| 5 | Machine Learning |
| 6 | Deep Learning & Neural Networks |
| 7 | Scientific Research & Applications |
| 8 | Control Systems & Robotics |
| 9 | Mathematical Optimization |
| 10 | Computer Vision & Multimedia |
- R >= 4.0
- The following R packages:
| Stage | Packages |
|---|---|
| Data Collection | aRxiv, dplyr, lubridate, purrr, readr, stringr |
| Text Preprocessing | dplyr, stringr, textclean, textstem, tm, tokenizers |
| Topic Modeling | tm, topicmodels, tidytext, broom, dplyr, readr, ggplot2, scales, tidyr, stringr, slam, reshape2, wordcloud, RColorBrewer |
Install all required packages from an R console:
install.packages(c(
"aRxiv", "dplyr", "lubridate", "purrr", "readr", "stringr",
"textclean", "textstem", "tm", "tokenizers",
"topicmodels", "tidytext", "broom", "ggplot2", "scales",
"tidyr", "slam", "reshape2", "wordcloud", "RColorBrewer"
))Run each script in order:
source("data_collection_api.r")Collects all cs.* arXiv papers from 2025-01-01 to the current date in 2-month chunks.
Saves incremental backups and writes the final dataset to arxiv_cs_complete(2024-2025).csv.
Key parameters inside collect_full_arxiv():
| Parameter | Default | Description |
|---|---|---|
start_date |
"2025-01-01" |
Start of the collection window |
end_date |
Sys.Date() |
End of the collection window |
output_file |
"arxiv_cs_complete(2024-2025).csv" |
Output CSV filename |
max_retries |
5 |
Max retry attempts per date chunk |
source("text_data_preprocessing.r")Reads the collected CSV, cleans abstracts, tokenizes, removes stopwords, and lemmatizes.
Outputs the processed corpus to token_text_arxiv_preprocessed.csv.
source("modeling_topic_lda.r")Builds a Document-Term Matrix, trains LDA with Gibbs sampling (k=10, burnin=1000, iter=2000), and generates all outputs.
| File | Description |
|---|---|
arxiv_cs_complete(2024-2025).csv |
Raw collected arXiv paper metadata |
token_text_arxiv_preprocessed.csv |
Cleaned & tokenized abstracts |
lda_topic_terms.csv |
Top 10 terms per topic with beta scores |
lda_document_topics_labeled.csv |
Per-document topic probability (gamma) |
wordclouds_all_topics.pdf |
Word cloud for the full corpus |
topic_wordclouds_all_terms.pdf |
Individual word clouds per topic |
lda_model.rds |
Serialized LDA model object |
dtm.rds |
Serialized Document-Term Matrix |
corpus.rds |
Serialized VCorpus object |