PubScraper

Program 1: TheScraper.py

-Searches PubMed for all publications from a given year for the affiliation “University of Massachusetts Amherst”

-For each hit, it finds the DOI in the citation and saves the citation including title, authors, etc.

-For each hit/DOI, it visits the doi.org site to view the manuscript

-It then scrolls the browser though the manuscript and takes screenshots

-It then stitches the screenshots together and does OCR to convert it to plain text

-It then does this for the remaining papers from that year

Program 2: TheSearcher.py

-This takes input from an internal webserver that is “user facing” (currently just a VM on a box in my office).

-It collects submitter name, year to search, and keyword list (this is the critical one and I would offer suggestions for end user input).

-It also accepts a csv file that contains, in my case, 2 columns listing every trainee who was a facility user and their advisor for a timeframe of 2 years from start of search criteria).

-Then it makes matches based on the names-csv->authors in citation and also keywords->raw text.

-I then apply a weighting to the scores and it outputs a ranked list based on probability that includes the entire citation and hyperlink to paper.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
1b-DOI_text_files_2020-just-a-handful		1b-DOI_text_files_2020-just-a-handful
1-TheScraper.py		1-TheScraper.py
1a-DOI_outputs.txt		1a-DOI_outputs.txt
2-TheSearcher.py		2-TheSearcher.py
2a-input_data.txt		2a-input_data.txt
2b-input_users.csv		2b-input_users.csv
2c-graded-outputs.csv		2c-graded-outputs.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubScraper

About

Releases

Packages

Languages

UMassCDS/PubScraper

Folders and files

Latest commit

History

Repository files navigation

PubScraper

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages