Trends in Bioinformatics

Proof of Concept

Project Goals

Graphic Visualization of trends in Bioinformatics

Usage of bioinformatic tools and technqiues (ex. sequence alignment, genome assembly, metagenomics etc)
Relationship of tool development within analytical pipelines
Geographic hotspots for development of bioinformatic tools and techniques

Team

Approach

Webscraping using the package fulltext package, processing the data in the package pubchunks code for this can be found in R/Fulltext_Workflow.Rmd.
- We will be looking for: doi, Year of Publication, Journal Name, publisher, author, research institution, title, abstract.
- The keywords used for searching determined by terms in the EDAM ontology database and previous knowledge of the field.
- The keywords have been set up in raw-data/SearchTerms.csv. We are working on optimizing the R/Fulltext_Workflow.Rmd file to loop through this CSV.
- Due to access issues, we are only looking at papers in plos
Visualization of trends in the the data with ggplot and other R visualization packages

Pre-requisites

These are the R Packages needed:

fulltext
pubchunks
tidyverse
magrittr
dplyr
purrr
here
future
lubridate
stringr
maps
viridis
rgeos
sf
ggmap
maptools
igraph
ggraph
tm
gganimate
data.table
textrank
udpipe
tidytext
ggplot2
magriter
plotly
googleVis
ggrepel
egg
grid
ggalluvial
widyr
readr
tidygraph

These are the Python (python3) packages needed:

googlemaps
pandas
Will also require a geocoding API from Google

Usage

Webscraping

To webscrape, run database_parallel2.R in the R directory after modifying the topic and output filename based on the names in the raw-data/SearchTerms.csv.

Visualization

Three different types of information will be plotted

General summary plots of topic coverage over time and topic coverage based on journal.
Analysis of bigrams (combinations of two words) used in publications to see common phrases and their relationships. Also looking to connect the authors and their subject matters using a sankey diagram.
Geographic mapping of insitutions involved in bioinformatic research. The file being used to generate visualizations R/visualization_map.Rmd
- Dynamic heatmaps generated with colours based on the number of active insitutions in the area. The maps provide global view, as well as specific USA and European maps due to the high number of results in those areas. Hovering over the countries/states shows the number of institutions and hovering over the points (where shown) identifies the institutions.
- An interactive world map showing the most used keywords in each country. Hovering over the countries shows the topic names.

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
R		R
Resources		Resources
data		data
figures		figures
python		python
raw-data		raw-data
.gitignore		.gitignore
README.md		README.md
hs19-trends.Rproj		hs19-trends.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trends in Bioinformatics

Proof of Concept

Project Goals

Team

Approach

Pre-requisites

Usage

Webscraping

Visualization

About

Releases

Packages

Contributors 9

Languages

hackseq/hs19-trends

Folders and files

Latest commit

History

Repository files navigation

Trends in Bioinformatics

Proof of Concept

Project Goals

Team

Approach

Pre-requisites

Usage

Webscraping

Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages