A Python tool to perform systematic literature reviews from searching on Web of Science (WoS), get useful information (see list below) and save it in tables (.csv). These codes thus enable dynamic web scraping of WoS, leveraged by selenium
and statically parsed by beautifulsoup4
.
This tool extracts the following data:
- First Author
- Date of publication (month and year)
- Journal
- Abstract
- Link on WoS
- Link to full text
- DOI
- Keywords and Plus-keywords
- Corresponding address
To run this tool, you'll need to setup a Python environment and install the necessary packages. You can do so by following these steps:
- Make sure you have Firefox installed
- Clone this repository
- Open the Anaconda prompt and create a new environment with
conda create --name <env_name> python=3.9
or use virtualenv (for Linux machines) withpython3 -m venv /path/to/new/virtual/environment
- cd (change directory) into this repository and install the necessary packages (in the
requirements.txt
):pip install -r requirements.txt
-
Once your environment is ready, you can either:
- Configure your IDE (e.g., PyCharm) to use the created environment, or
- Activate the environment and run the code through the Anaconda Prompt. To run the code through Anaconda, don't forget to activate the environment by:
conda activate <env_name>
, cd into your directory containing your script that call the scraper functions, and run the codes bypython <script_name.py>
.
-
For setting up your search on Web of Science, follow the example provided in the
/example/
folder -
The codes will prompt first a firefox window and then it will keep opening new windows, each for a certain pagination stemming from the web search. You can close them all manually (except for the last-opened window, which is being dynamically scraped) or supress their opening with function arguments of the function
scroll_and_click_showmore
of the modulesetup_page
.
Currently, this tool is not yet packaged (no pip install). Thus, to call the modules, you can easily use the \example\
folder and use it as template for your project. Go to main.py
and adapt the code as following:
from wos_scraper import setup_page
from wos_scraper import parse_soup
# Provide the link to the search, for instance:
search = 'https://www.webofscience.com/wos/woscc/summary/ff7d7f65-1ac6-4213-b788-f3caf673d7fd-6c336e02/relevance/1'
# Get and save htmls of each pagination from 1 to 49.
# This code line will save the html files corresponding to each page
# (from 1 to 49 in this case) in the same folder.
html_list, html_save_files = setup_page.get_html_through_paginations(search, range(1, 50))
# Parse htmls and produce tables with author, title, abstracts, etc
# Here, the html files will be parsed and tables for each html will be saved in the current folder.
for f in html_save_files:
htmlfile = open(f, 'r', encoding='utf-8').read()
df = parse_soup.parse_html_get_table(htmlfile)
df.to_csv('df-{}.csv'.format(f))
# Get further specific details of the papers (DOI, keywords, plus-keywords, research areas, corresponding address) from their WoS links:
for i in range(1, 49):
f = 'df-{}.csv'.format(i)
df = pd.read_csv(f)
df_new = parse_papers.parse_papers_from_urls(df, column='wos_link')
df_new.to_csv('df-appended-{}.csv'.format(i))
Try re-running the code two or tree times. This is because the first cookie prompted by Web of Science may take some time to show up depending on your internet speed.
This is currently a limitation of the tool and is under investigation. The code automatically scrolls and clicks on the button "Show More", which allows for opening the abstract and thus subsequent parsing of the html. However, the WoS server often notices this systematic behavior and blocks the automated clicking.
Web scraping is an elegant way to extract publicly available that otherwise would need to be done manually and take an eternity. However, any scraping method needs to have sleep times throughout the code (the well-known time.sleep()
) in order to safely interact with the host's server. If multiple requests are sent simultaneously, the host's server can be compromised and the scraping might damage its functioning. Thus, don’t find it strange that the scraping is taking hours, this is how it should happen.
If you were running the code to loop through n pages (range(1, n+1)
) and the program is terminated at certain page p, simply restart the function setup_page.get_html_through_paginations()
setting the arg pags
to range(p, n+1)
. That will continue the web scraping and then you can parse the htmls with parse_soup.parse_html_get_table()
.
Please submit your issues, possible improvements, and bugs by openning an issue. Answers should not take more than a day.