Skip to content

IMPC spreadsheet (reports) production pipeline

Hamed Ha edited this page Jul 26, 2022 · 1 revision

The IMPC spreadsheets are useful summaries that include the most common statistics for research projects. These spreadsheets include the pvalues, effect sizes, windowing results, MP terms, adjusted pvalues (qvalues) and so on.

Creating the spreadsheets requires an instance of the IMPC statistical pipeline explained here. In this wiki, we explain the steps that are required for creating the IMPC statistical results spreadsheets.

Preparation

The IMPC Spreadsheet Production Pipeline (IMPC-SPP) requires a full run of the IMPC statistical pipeline explained here. The whole process is managed in R and the package DRrequiredAgeing in here. Although the execution of the IMPC statistical pipeline already prepares the environment for the postprocessing pipelines such as IMPC-SPP, we repeat the process again here:

  • Download and run the preparation script from here or this script if you have already prepared the environment. To run these scripts you need to run R and type in source "FULL_PATH_TO_THE_SCRIPT"

Execution of the IMPC-SPP

Running the IMPC-SPP requires Linux/UNIX machines with LSF installed. The whole process is managed by the function DRrequiredAgeing:::IMPC_statspipelinePostProcess in the DRrequiredAgeing package. Note that this is not an official package and is only designed for the IMPC statistical pipeline. To start the IMPC-SSP, open your favourite terminal and navigate to the IMPC statistical pipeline directory then open R and type in DRrequiredAgeing:::IMPC_statspipelinePostProcess().

The command DRrequiredAgeing:::IMPC_statspipelinePostProcess accepts three parameters: IMPC_statspipelinePostProcess (SP.results = getwd(), waitUntillSee = 'No unfinished job found', ignoreThisLineInWaitingCheck = 0). The first parameter is the FULL path to the results directory of the IMPC SP, the second and third parameters are to be compatible with the LSF cluster management interface. We recommend running the DRrequiredAgeing:::IMPC_statspipelinePostProcess command under the default setting.

What does the function do?

The function does a range of operations listed below:

  1. Compressing the log files from the IMPC SP
  2. Creating an index directory named SingleIndeces containing an index file of the results in the IMPC-SP
  3. A directory named ExtractPvals that works as the active directory for all operations. Note that all operations follow by the clean-up/compression so there is no need to clean up the leftovers after the execution of the IMPC-SPP.
  4. the output is Two spreadsheets for the categorical and continuous results in the SP.results directory.

How to format the spreadsheets?

Formating and colouring the spreadsheets make the results more exciting to work with. To format and colour the results, open an example from this url and follow/copy the formatting in the spreadsheet. image

Frequently asked questions

The process never ends. You should run only IMPC SPP on the LSF and make sure no running job prior to the execution of the IMPC SPP

How long the process takes to complete? On the basis of the IMPC LSF cluster, it should take a maximum of 24 hours.

Why do I get errors? Errors are accompanied by a short description. Most of the operations in the IMPC SPP are file operations then a good knowledge of the UNIX/Linux file system would be enough to debug the pipeline.

Where are the final spreadsheets? The output of the IMPC SPP pipeline would be two spreadsheets in the results directory that is set in the input of the function otherwise the R working directory (run getwd() in R to see the full path)

How to run R in the terminal? Just type in R in your terminal to start an R session. If you get an error when executing the R command then you need to install R first. To this end follow in official instructions on the official CRAN website

Need help?

We are always available for help from the IMPC website or email hamedhm@ebi.ac.uk