Skip to content

IMPC annotation pipeline

Hamed Ha edited this page Mar 22, 2021 · 14 revisions

IMPC annotation pipeline

The annotation pipeline in the Internationa Mouse Phenotyping Consortium (IMPC) is an exciting data assignment project to associate phenotypic observations to the genetic modification. Here in this document, we describe the steps that are taken to select the best Mamalian Phenotype (MP) term to the genetic modification in mice when a statistically significant change (typically at the level of 0.0001) from the baselines observed.

Annotation pipeline and the analysis framework

The IMPC annotation pipeline (IMPC-AP) assigns MP terms to the statistically significant genetic effect. The genetic effect at the IMPC is specified by three statistical analysis frameworks that are designed in the IMPC statistical pipeline through OpenStats software. Here we break the annotation pipeline by the type of the input data and the analysis frameworks.

Annotation indexer

The annotation pipeline in the IMPC requires a reference (lookup) table that summarizes the available terms for an IMPC parameter. This can be retrieved from IMPReSS however to remove the dependency to the live transactions, the IMPC-AP utilised an offline version of the file called Annotation Indexer in this document. An instance of this file is available from this link [note that this file is in the Rdata format that requires R software to open].

Continuous data – Linear mixed model

Continuous data such as tail length, tibia length etc. in the IMPC is analysed by linear mixed model, implemented in the software package OpenStats. The continuous measurements are more informative than the other types such as categorical data in the aspect that the direction of change can be determined by the effect size (increasing/decreasing/steady effect). Here we summarise the steps to assign MP terms to the continuous measurements.

From the statistical results From the annotation indexer
1*. Overal effect (both sexes)
- if pvalue ≥ threshold → no MP term
- if pvalue < threshold (II)
      1. If effect size > 0 → Increase term
      2. if effect size <0 → Decrease term
      3. if effect size=0 → Steady term
- similar steps in 1* applies to the male effect (III)
- similar steps in 1* applies to the female effect (IV)
Filter for
  1. Pipeline_stable_id
  2. Procedure_group
  3. Parameter_stable_id

Get available MP terms (I)
Find matches between I and II, III, IV
Note - If conflicting increase or decrease effect detected then assign Abnormal MP term.
- Generally accepted threshold by the IMPC consortium is 0.0001

Continuous data – Reference Range plus

Due to the complexity of the data in the IMPC, not all continues data can be analysed by linear mixed model. Alternatively, there are many cases in the IMPC that are analysed by Reference Range plus (RR+) method implemented in the OpenStats software package. RR+ is a heuristic method that works based on discretizing baselines into reference categories precisely low/normal/high. The mutants then are assigned a class based on the reference categories. Finally, Fisher's Exact test applies to specify any significant deviation from the normal category. Here we explain the MP term assignment algorithm for the results from the RR+ framework.

From the statistical results From the Annotation Indexer
1*. Overal effect (do not consider gender)
   - if pvalue.low ≥ threshold & pvalue.high ≥ threshold → Assign no MP term
   - for each pvalue(.low/.high) < threshold → assign temporary labels Abnormal, Increase and Decrease to the search criteria
   - Remove any Low.Increase and High.Decrease from the labels (II)
   - similar steps in 1* applies to the male effect (III)
   - similar steps in 1* applies to the female effect (IV)
Filter for
   1. Pipeline_stable_id
   2. Procedure_group
   3. Parameter_stable_id
   Get available MP terms (I)
Find matches between I and II, III, IV regardless of the prefix Low. and High.
Note - IF conflicting Low. and High. MP terms detected then select Abnormal term
- Generally accepted threshold by the IMPC consortium is 0.0001

Categorical data - Fisher's Exact Test

Categorical data in the IMPC encompasses a range of qualitative measurements such as abnormality in the eye, ear, tail and are analysed using Fisher's Exact test implemented in the R package OpenStats. The output MP term for this type of data is a single term Abnormal phenotype if the statistical test is significant. Here we explain the algorithm to assign the MP term annotations to this type of results.

From the statistical results From the Annotation Indexer
1*. Overall effect (do not consider gender)
   - if pvalue ≥ threshold → Assign no MP term
   - if pvalue < threshold → search for an MP term (II)

   - similar steps to 1* applies to the male effect (III)
   - similar steps to 1* applies to the female effect (IV)
Filter for
   1. Pipeline_stable_id
   
   2. Procedure_group
   3. Parameter_stable_id
   Get available MP terms (I)
Find matches between I and II, III, IV
Note - Generally accepted threshold by the IMPC consortium is 0.0001

Schematic view of the IMPC-AP

How to run the IMPC annotation pipeline

To execute the IMPC-AP, follow the steps in the following video YouTube

How to make changes to the pipeline or update bits

Making any change to the IMPC annotation pipeline requires a change to the DRrequiredAgeing package (link) and in particular the annotation or if you follow the video this directory. For example, if you intend to use an updated version of the MP (mammalian phenotype) reference (known as mp_chooser_XXXX.json.Rdata in the annotation directory), you must create a similar data structure to the original Rdata file, for instance in the simplest case, if the raw reference file is in JSON then the following steps required to create the Rdata,

  1. library(jsonlite)
  2. a <- jsonlite:::fromJSON(txt = 'Your JSON FILE.json')
  3. save(a, file = "mp_chooser_XXXX.json.Rdata")

where XXXX is normally the date and timestamp. Next, the created Rdata file needs to be copied into this directory (or the corresponded directory in your forked version) and the reference in this R script (or forked version) needs to be modified (search for mp_chooser to find the reference) to the new file name.

Finally, any change to the package requires an update on any client that executes the pipeline. To do this, open your R session and execute the following R codes:

  1. install.packages('devtools')
  2. devtools::install_github( repo = "the URL to the fork from the DRrequiredAgeing repository", dependencies = FALSE, upgrade = "never", force = TRUE, build = TRUE, quiet = FALSE )

How to easily create the index file?

IMPC-AP requires an index file to go through the files and assign the MP term(s) to the result. Creating the index file on Linux requires specially written scripts and could take a significant amount of time. To facilitate this step, given you use LSF to manage your cluster computing, you could follow the steps below:

  1. Navigate to the path that StatPakcets are stored (StatPacket refers to the output of the IMPC statistical pipeline)
  2. Open an R session and execute the following code: DRrequiredAgeing:::minijobsCreator(path=getwd(),depth = 3,fname = 'the output file with LSF job defined')

you can modify the code by changing the "depth" parameter to specify the depth of the initial jobs (the deeper the slower the LSF job creation, recommended 3). Once the output file is created, you can submit the file to the LSF cluster ('./NAME OF THE FILE'). This creates several .Ind files that can be merged into a single index by running the command, cat *.Ind >> YOUR INDEX FILENAME.

Everything in a single line

All the steps above are aggregated into a single R function in the package DRrequiredAgeing (link). To use this functionality, you need an instance of the IMPC statistical result produced from the IMPC-SP (link). If you have already executed the IMPC-SP on LSF cluster computing infrastructure, then follow the steps below to store StatPackets and MP terms into a predefined database.

  • Prepare the environment by making sure that you have a Postgress database that is already set up with a minimum of 1500 connections allowed. Take note of the server IP, port number, username, password and db name; and choose a name for the table that will be created for the results.
  • Open your favourite Unix terminal and navigate to the directory that IMPC-SR are stored in. By default, this directory is called IMPC_SR and it should be located somewhere like SP/jobs/IMPC-SR.
  • Make sure no jobs are currently running on the LSF cluster.
  • Open an R session by typing R in your terminal.
  • Run the following command by setting the proper values for the parameter: DRrequiredAgeing:::IMPC_annotationPostProcess(host = hostIp,dbname = database_name,port = database_port,user = user_name,password = database_password,outputdb = table_name_that_will_be_created,level = .0001)

Notes

  • Note that you can set the significance level for the pvalues to be processed for annotations in level argument of the function. The default value is 0.0001.

How to debug?

  • The IMPC_annotationPostProcess is a wrapper around the Linux command operations. To debug, you need a basic knowledge of the Unix command line and it helps to debug the simple errors. If you encountered a more serious problem on the R side of the job, please feel free to reach us for help.

What is the output of the one line annotation production pipeline

The output of the function is stored in the database, the columns in the db table are experiment details and the last column contains the StatPackets that have the annotation values in the MPterm and WMPterm keys. If you would like to get the MP terms not stored in the db please refer to this link.

How to ask for help?

We are always available for help from IMPC website or email hamedhm@ebi.ac.uk

Clone this wiki locally