A neat script to generate an customizable wordcloud by scraping abstracts on Pubmed using a user-defined query.
The idea is to use smart advanced Pubmed queries to obtain a wordcloud showing the most frequent words appearing in abstracts.
See wordcloud_settings.json for the settings used to generate the output below.
- Biopython: for accessing Pubmed and fetching publications
- PIL: for reading mask file and saving final output
- Wordcloud by Andreas Mueller: for the generation of the actual wordcloud
Marco Dalla Vecchia
- Install conda
- Create conda environment
$ conda env create -f requirements.yml
- Activate environment
$ conda activate biopython-wordcloud-env
- Run the script and follow the instructions
In case you want to make use of them, make sure to have the mask and the json file in the same folder as the python script
$ python wordcloud_from_input.py
I was planning to make a single file executable file for Windows but I don't know how to handle the dependencies yet.
The script is designed to ask the user the most important information and settings for the creation of the wordcloud.
The script will ask for the following:
- Is there a json config file already? If yes, it will generate a wordcloud purely based on those configurations. If not, proceed.
- Email address → checked if it's valid format
- Query → this is an pubmed advanced query, use the online tool to find the desired query then copy/paste it here
- Background color → this is the color used for the background of the generated wordcloud. Defaults to transparent.
- Colormap → this is a matplotlib valid colormap used to color the text of the wordcloud. Check the webpage and type in the name. Defaults to viridis.
- Maximum number of fetched publications → this is the max number of papers fetched from Pubmed to create the wordcloud. Defaults to 300.
- Name of mask file → this is the name of the black and white mask image which can optionally be used to give the wordcloud a custom shape. Defaults to no mask.
- papers.txt → this will contain DOI info, title, authors and abstract of found papers from Pubmed. Only the abstract texts will be used for the creation of the wordcloud.
- wordcloud_settings.json → json file containing the settings used in the creation of the last wordcloud. It can be reused.
- wordcloud.png → final output (png format to allow for transparency)
This script can easily be adapted to other circumstances or the wordcloud settings can be further controlled by changing the code directly. Feel free to suggest possible improvements!