Graphic Visualization of trends in Bioinformatics
- Usage of bioinformatic tools and technqiues (ex. sequence alignment, genome assembly, metagenomics etc)
- Relationship of tool development within analytical pipelines
- Geographic hotspots for development of bioinformatic tools and techniques
- Diana Lin
- Emma Garlock
- Jasmine Lai
- Raissa Philibert
- Morgana Xu
- Lucia Darrow
- Swapna Menon
- Shannon Lo
- Elliot YKF
- Webscraping using the package
fulltext
package, processing the data in the packagepubchunks
code for this can be found inR/Fulltext_Workflow.Rmd
.- We will be looking for:
doi
,Year of Publication
,Journal Name
,publisher
,author
,research institution
,title
,abstract
. - The keywords used for searching determined by terms in the EDAM ontology database and previous knowledge of the field.
- The keywords have been set up in
raw-data/SearchTerms.csv
. We are working on optimizing theR/Fulltext_Workflow.Rmd
file to loop through this CSV. - Due to access issues, we are only looking at papers in
plos
- We will be looking for:
- Visualization of trends in the the data with ggplot and other R visualization packages
These are the R Packages needed:
fulltext
pubchunks
tidyverse
magrittr
dplyr
purrr
here
future
lubridate
stringr
maps
viridis
rgeos
sf
ggmap
maptools
igraph
ggraph
tm
gganimate
data.table
textrank
udpipe
tidytext
ggplot2
magriter
plotly
googleVis
ggrepel
egg
grid
ggalluvial
widyr
readr
tidygraph
These are the Python (python3
) packages needed:
googlemaps
pandas
- Will also require a geocoding API from Google
To webscrape, run database_parallel2.R
in the R
directory after modifying the topic
and output filename based on the names in the raw-data/SearchTerms.csv
.
Three different types of information will be plotted
- General summary plots of topic coverage over time and topic coverage based on journal.
- Analysis of bigrams (combinations of two words) used in publications to see common phrases and their relationships. Also looking to connect the authors and their subject matters using a sankey diagram.
- Geographic mapping of insitutions involved in bioinformatic research. The file being used to generate visualizations
R/visualization_map.Rmd
- Dynamic heatmaps generated with colours based on the number of active insitutions in the area. The maps provide global view, as well as specific USA and European maps due to the high number of results in those areas. Hovering over the countries/states shows the number of institutions and hovering over the points (where shown) identifies the institutions.
- An interactive world map showing the most used keywords in each country. Hovering over the countries shows the topic names.