Skip to content
This repository has been archived by the owner on Dec 6, 2022. It is now read-only.

Latest commit

 

History

History
279 lines (226 loc) · 11.9 KB

README.md

File metadata and controls

279 lines (226 loc) · 11.9 KB

biodb package

Codecov test coverage

An R package for connecting to chemical and biological databases.

Introduction

biodb is a framework for developing database connectors. It is delivered with some non-remote connectors (for CSV file or SQLite db), but the main interest of the package is to ease development of your own connectors. Some connectors are already available in other packages (e.g.: biodbChebi, biodbHmdb, biodbKegg, biodbLipidmaps, biodbUniprot) on GitHub. For now, the targeted databases are the ones that store molecules, proteins, lipids and MS spectra. However other type of databases (NMR database for instance) could also be targeted.

With biodb you can:

  • Define your own database connector.
  • Access entries by accession number and let biodb download them for you.
  • Take advantage of the cache system, that saves the results of all sent requests for you. If you send again the same request, the cached result will be used instead of contacting the database. The cache system can be disabled.
  • Download locally a downloadable database and access entries by accession number locally.
  • Rely on biodb to access correctly the database, respecting the published access policy (i.e.: not sending too much requests). biodb uses a special class for scheduling requests on each database.
  • Switch from one database to another easily (providing they offer the same type of information), not changing a line in your code. This is because entries are populated with values found from the database, using always the same keys.
  • Search for MS and MSMS spectra by peaks in Mass spectra databases.
  • Export any database into a CSV file or record it into an SQLite file.

Installation

Install the latest stable version using Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install('biodb')

Installing from GitHub

You can install the latest development version of biodb from GitHub:

install.packages('devtools')
devtools::install_github('pkrog/biodb', dependencies=TRUE)

Installing extension packages

Alongside biodb you can install the following R extension packages that use biodb for implementing connectors to online databases:

Installation of one of those extension packages can be done with the following command (replace 'biodbKegg' with the name of the wanted package):

devtools::install_github('pkrog/biodbKegg', dependencies=TRUE)

Installation with Bioconda

biodb is part of Bioconda, so you can install it using Conda. This means also that it is possible to install it automatically in Galaxy, for a tool, if the Conda system is enabled.

Databases and fields accessible with biodb

The biodb package contains the following in-house database connectors:

  • Compound CSV File (an in-house database stored inside a CSV file).
  • Mass CSV File (an in-house database stored inside a CSV file).
  • Mass SQLite (an in-house database stored inside an SQLite file).

Here are some of the fields accessible through the retrieved entries (more fields are defined in extension packages):

  • Chemical formula.
  • InChI.
  • InChI Key.
  • SMILES.
  • Common names and IUPAC names.
  • Charge.
  • Average mass.
  • Monoisotopic mass.
  • Molecular mass.
  • MS device.
  • MS Level.
  • MS mode.
  • MS precursor M/Z.
  • MS precursor annotation.
  • Peaks' M/Z values.
  • Peaks' intensities.
  • Peaks' relative intensities.
  • Attributions of peaks.
  • Compositions of peaks.
  • Peak table.
  • Chromatographic column name.
  • Chromatographic column length.
  • Chromatographic column diameter.
  • Chromatographic solvent.
  • Chromatographic retention time.
  • Chromatographic retention time unit.

Examples

Getting entries from a remote database

Here is an example on how to retrieve entries from ChEBI database and get a data frames of them (you must first install both biodb and biodbChebi packages):

bdb <- boidb::newInst()
chebi <- bdb$getFactory()$createConn('chebi')
entries <- chebi$getEntry(c('2528', '7799', '15440'))
bdb$entriesToDataframe(entries)

Searching for a compound

All compound databases (ChEBI, Compound CSV File, KEGG Compound, ...) can be searched for compounds using the same function. Once you have your connector instance, you just have to call searchCompound() on it:

myconn$searchCompound(name='phosphate')

The function will return a character vector containing all identifiers of matching entries.

It is also possible to search by mass, choosing the mass field you want (if this mass particular field is handled by the database):

myconn$searchCompound(mass=230.02, mass.field='monoisotopic.mass', mass.tol=0.01)

Searching by both name and mass is also possible.

myconn$searchCompound(name='phosphate', mass=230.02, mass.field='monoisotopic.mass', mass.tol=0.01)

Searching for a mass spectrum

All mass spectra databases (Mass CSV File and Mass SQLite) can be searched for mass spectra using the same function searchMsEntries():

myconn$searchMsEntries(mz.min=40, mz.max=41)

The function will return a character vector containing all identifiers of matching entries (i.e.: spectra containing at least one peak inside this M/Z range).

Annotating a mass spectrum

Annotating a mass spectrum can be done either using a mass spectra database or a compound database.

When using a mass spectra database, the function to call is searchMsPeaks():

myMassConn$searchMsPeaks(myInputDataFrame, mz.tol=0.1, mz.tol.unit='plain', ms.mode='pos')

It returns a new data frame containing the annotations.

When using a compound database, the function to call is annotateMzValues():

myCompoundConn$annotateMzValues(myInputDataFrame, mz.tol=0.1, mz.tol.unit='plain', ms.mode='neg')

It returns a new data frame containing the annotations.

Defining a new field

Defining a new field for a database is done in two steps, using definitions written inside a YAML file.

First we define the new field. Here we define the ChEBI database field for stars indicator (quality curation indicator):

fields:
  n_stars:
    description: The ChEBI example stars indicator.
    class: integer

Then we define the parsing expression to use in ChEBI connector in order to parse the field's value:

databases:
  chebi:
    parsing.expr:
      n_stars: //chebi:return/chebi:entityStar

We now have just to load the YAML file definition into biodb (in extension packages, this is done automatically):

mybiodb$loadDefinitions('my_definitions.yml')

Parsing may be more complex for some fields or databases. In that case it is possible to write specific code in the database entry class for parsing these fields.

Defining a new connector

Defining a new connector is done by writing two RC classes and a YAML definition:

  • An RC class for the connector, named MyDatabaseConn.R.
  • An RC class for the entry, named MyDatabaseEntry.R.
  • A definition YAML file containing metadata about the new connector, like:
  • The URLs (main URL, web service base URL, etc.) for a remote database.
  • The timing for querying a remote database (maximum number of requests per second).
  • The name.
  • The parsing expressions used for parsing the entry fields.
  • The type of content retrieved from the database when downloading an entry (plain text, XML, HTML, JSON, ...).

For a good starting example of defining a new remote connector, see biodbChebi the ChEBI extension for biodb at https://github.com/pkrog/biodbChebi. In particular:

Using the extension generator

A set of classes and methods are provided by biodb to generate a skeleton of a new repository for a new connector. The easiest way to use this feature is through the method biodb::genNewExtPkg(). Here is an example which creates an new repository for a new connector to the Foo remote database on how to use it with some comments:

biodb::genNewExtPkg(
  path      = 'the/path/to/biodbFoo', # The repository folder.
# pkgName   = 'myName',       # By default the laste folder of `path` is used
                              # so you do not need to modify it.
  email     = 'your@e.mail',  # The author's email.
  dbName    = 'foo.db',       # The connector name that will be used by biodb.
  dbTitle   = 'Foo database', # A short description of the connector's database.
# pkgLicense = '...',         # The generated license is always AGPL-3.
  firstname = 'Your firstname',
  lastname  = 'Your lastname',
  connType  = 'compound',     # Use 'mass' for an MS database or 'plain' for any
                              # other type. Run `biodb::getConnTypes()` to get a
                              # full list of all available types.
  entryType = 'txt',          # Other possible types are: 'plain', 'csv',
                              # 'html', 'json', 'list', 'sdf' and  'xml'.
                              # Run `biodb::getEntryTypes()` to get a full list
                              # of all available types.
  editable  = FALSE,          # If the database is editable in memory.
  writable  = FALSE,          # If the database is writable on disk (like a CSV
                              # file).
  remote    = TRUE,           # If the database is accessed through web protocol
                              # like HTTPS, as oppose to local database stored
                              # inside an SQLite file or a CSV file.
  downloadable = FALSE,       # Set it to TRUE for a remote database that allows
                              # the download of its full content (e.g.: through
                              # the download of a zip file).
  makefile     = TRUE,        # Generate a Makefile file, useful for maintenance
                              # UNIX/Linux systems.
  rcpp         = FALSE,       # If set to TRUE, the package will be configured
                              # to use Rcpp and skeleton files will be generated
                              # with examples and test examples.
# vignetteName = '...',       # By default the vignette name will be the package
                              # name.
  githubRepos  = 'id/repos'   # The repository URL on GitHub (e.g.:
                              # 'pkrog/biodbChebi'). 
)

Documentation

Once in R, you can get an introduction to the package with:

?biodb

Then each class has its own documentation. For instance, to get help about the BiodbFactory class:

?biodb::BiodbFactory

Several vignettes are also available. To get a list of them run:

vignette(package='biodb')

To open a vignette in a browser, use its name:

vignette('new_connector', package='biodb')

Contributing

If you wish to contribute to the biodb package, you first need to create an account under GitHub. You can then either ask to become a contributor or fork the project and submit a merge request.

Debugging, enhancement or creation of a database connector or an entry parser are of course most welcome.