-
Notifications
You must be signed in to change notification settings - Fork 2
IMPC spreadsheet (reports) production pipeline
The IMPC spreadsheets are useful summaries that include the most common statistics for research projects. These spreadsheets include the pvalues, effect sizes, windowing results, MP terms, adjusted pvalues (qvalues) and so on.
Creating the spreadsheets requires an instance of the IMPC statistical pipeline explained here. In this wiki, we explain the steps that are required for creating the IMPC statistical results spreadsheets.
The IMPC Spreadsheet Production Pipeline (IMPC-SPP) requires a full run of the IMPC statistical pipeline explained here. The whole process is managed in R and the package DRrequiredAgeing in here. Although the execution of the IMPC statistical pipeline already prepares the environment for the postprocessing pipelines such as IMPC-SPP, we repeat the process again here:
- Download and run the preparation script from here or this script if you have already prepared the environment. To run these scripts you need to run R and type in
source "FULL_PATH_TO_THE_SCRIPT"
Running the IMPC-SPP requires Linux/UNIX machines with LSF installed. The whole process is managed by the function DRrequiredAgeing:::IMPC_statspipelinePostProcess
in the DRrequiredAgeing package. Note that this is not an official package and is only designed for the IMPC statistical pipeline. To start the IMPC-SSP, open your favourite terminal and navigate to the IMPC statistical pipeline directory then open R and type in DRrequiredAgeing:::IMPC_statspipelinePostProcess()
.
The command DRrequiredAgeing:::IMPC_statspipelinePostProcess
accepts three parameters:
IMPC_statspipelinePostProcess (SP.results = getwd(), waitUntillSee = 'No unfinished job found', ignoreThisLineInWaitingCheck = 0)
. The first parameter is the FULL path to the results directory of the IMPC SP, the second and third parameters are to be compatible with the LSF cluster management interface. We recommend running the DRrequiredAgeing:::IMPC_statspipelinePostProcess
command under the default setting.
The function does a range of operations listed below:
- Compressing the log files from the IMPC SP
- Creating an index directory named SingleIndeces containing an index file of the results in the IMPC-SP
- A directory named ExtractPvals that works as the active directory for all operations. Note that all operations follow by the clean-up/compression so there is no need to clean up the leftovers after the execution of the IMPC-SPP.
- the output is Two spreadsheets for the categorical and continuous results in the
SP.results
directory.
Formating and colouring the spreadsheets make the results more exciting to work with. To format and colour the results, open an example from this url and follow/copy the formatting in the spreadsheet.
The process never ends. You should run only IMPC SPP on the LSF and make sure no running job prior to the execution of the IMPC SPP
How long the process takes to complete? On the basis of the IMPC LSF cluster, it should take a maximum of 24 hours.
Why do I get errors? Errors are accompanied by a short description. Most of the operations in the IMPC SPP are file operations then a good knowledge of the UNIX/Linux file system would be enough to debug the pipeline.
Where are the final spreadsheets? The output of the IMPC SPP pipeline would be two spreadsheets in the results directory that is set in the input of the function otherwise the R working directory (run getwd()
in R to see the full path)
How to run R in the terminal? Just type in R in your terminal to start an R session. If you get an error when executing the R command then you need to install R first. To this end follow in official instructions on the official CRAN website
We are always available for help from the IMPC website or email hamedhm@ebi.ac.uk