Zotero and arXiv Recomendation System

-- Project Status: [ Active ]

Project Intro/Objective

This project contains some code meant to analyse my Zotero library containg journal articles I have been amassing for many years. The objective is to build some scraping and recommendation system on the arXiv in order to find and classify new papers and help find relevant research tuned to my interests. This project can be viewed as an improvement over the basic scraper and keyword highlighter I previously developped: arxiv_scanner_flask.

Methods Used

Data Analysis
Machine Learning
Data Visualization
Predictive Modeling
Content-based recommendation system
Natural Language Processing
Web scraping

Technologies

Python
Pandas, Scikit-learn, numpy

Project Description

This project involved various critical aspects of data science and is meant as a training project to bring to production some useful product. The various interesting steps are:

ETL: Merge my Zotero library and a random arXiv sample of papers (different schema).
Analysis: Analyze the dataset of arXiv articles (included in my library or not) in order to identify trends in topics, authors.
Encoding: A novel technical aspect I used this project to train myself on is encoding text-based features.
Recommender system: Using the sparsely encoded title, author list and category list of the articles, I built and compared various cosine similarity matrices which then served as recommendation matrices. This is a simple yet very effective system.
Classifier: I built an unsupervised clustering system for topics using the summary column and non-negative matrix factorization. Each author is then attributed a list of most frequent topics which will be used as encoding for authors. Therefore, similarity between authors will now mean similarity of interests instead of textual similarity.
Packaging: This project can be ran as a standalone script with any arXiv identifier. The program will first pull the article from arXiv and then run the similarity pipeline before returning recommendations.

Getting Started

The notebook dealing with the data merging and arXiv random sampling can be found in this notebook. The proof of concept for the recommender system and the classifier can be found in this notebook. The improvement with target encoding for authors can be found in this notebook.

To see how to use this code, just run python3 main.py --help or any of the command below:

# Get recommendations for an article
python3 main.py 2303.17685

# Change the number of recommendations
python3 main.py 2303.17685 -n 30

# Use the target encoding instead of basic encoding for authors (slower)
python3 main.py 2303.17685 --encode_topics

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
figs		figs
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zotero and arXiv Recomendation System

-- Project Status: [ Active ]

Project Intro/Objective

Methods Used

Technologies

Project Description

Getting Started

About

Releases

Packages

Languages

License

NicolasChagnet/arxiv-recommendations

Folders and files

Latest commit

History

Repository files navigation

Zotero and arXiv Recomendation System

-- Project Status: [ Active ]

Project Intro/Objective

Methods Used

Technologies

Project Description

Getting Started

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages