Skip to content

Latest commit

 

History

History
25 lines (17 loc) · 2.03 KB

README.md

File metadata and controls

25 lines (17 loc) · 2.03 KB

Faculty Research in Animal Science

Overview
This is a project for my Data & Web Technologies for Data Analysis class. Skills learned include data wrangling using pandas, plotting using seaborn, word clouds, natural language processing using scikitlearn and TfidfVectorizer, knn regression, knn classification, and kmeans clustering.

I used the scholarly package to get publication titles, years, and number of citations from Google Scholar for animal science professors at 5 universities in the US. One goal was to see whether research topics vary a lot by university. For this, I made word clouds. Another goal was to make predictions based on the text of the titles.

Getting the data

  • Notebook: data_acquisition.ipynb
  • Output dataset: data.txt

In this notebook, I used the scholarly.search_author() function from scholarly package to get publication data for animal science professors at 5 universities. At the end, I also discuss the pros and cons of the data extraction method that I used.

Exploring and Cleaning the Data

  • Notebook: data_exploration.ipynb
  • Input dataset: data.txt
  • Output dataset: cleaned.txt

This notebook is my favorite! In this notebook, I found some non-English publication titles, and checked for and removed erroneous observations and profiles. I also did some summary statistics including counting the number of publications per professor and exploring the relationship between age and number of publications. I also made word clouds to see what topics are popular at each university.

Modeling

  • Notebook: statistics.ipynb
  • Input dataset: cleaned.txt

In this notebook, I wanted to see what kinds of things I could predict using the text of the titles. One thing I tried was predicting the number of citations based on the title. For this, I used knn regression and that didn't give a very effective prediction. I then used a knn classifier to predict the author based on the title. I also used a separate knn classifier to predict the school based on the title.