Skip to content

IMPC statistical pipeline documentation

Hamed Ha edited this page Feb 8, 2022 · 49 revisions

Contents

  • IMPC statistical pipeline (IMPC-SP)
    • EBI Codon cluster preparation
  • Preprocessing the raw data
  • Packaging the raw data for parallelisation
  • Executing the IMPC statistical pipeline
  • Preparation before running the stats pipeline
  • Executing the pipeline
  • Postprocessing the IMPC-SP
  • Frequently asked questions
  • One line execution of the IMPC-SP

IMPC statistical pipeline

Working with IMPC data is an exciting experience for any data scientist. However, the nature of high-throughput pipelines allows too many data points that overflow the complexity of running the statistical analysis pipeline. In this manual, we describe step by step the execution of the IMPC statistical pipeline (IMPC-SP). To follow this manual, the following software must be installed on your/server machine:

  1. Unix/Linux operation system
  2. Unix/Linux terminal
  3. IBM LSF platform https://en.wikipedia.org/wiki/Platform_LSF
  4. R https://cran.r-project.org/

EBI Codon cluster preparation with Conda

This is easier than working with the OS directly. To prepare the environment follow the steps below:

install miniconda from https://docs.conda.io/en/latest/miniconda.html

# Create a new conda environment named R
conda create -n R
# Activate it
conda activate R
# Install R dependencies
conda install -c conda-forge r-essentials geos nlopt r-rgeos r-rjags r-terra r-rgdal r-tcltk2 r-rlang r-nloptr

Create a bash script with the content below and source it in your terminal startup. Note that ~/DRs/R/packages/4_0_3 can be any path with R/W permission and enough disk space.

mkdir -p ~/DRs/R/packages/4_0_3
R_LIBS_USER=~/DRs/R/packages/4_0_3
export R_LIBS_USER
conda activate R

EBI Codon cluster preparation without Conda

Codon cluster at EBI requires special care when it goes to the environment preparation.

  • Please add the Postsql installation path and R package libraries path to the .bash_profile
PATH=/usr/pgsql-11/bin:/usr/pgsql-9.6/include:$HOME/bin:$PATH
R_LIBS_USER=~/DRs/R/packages/4_0_3 # OR any other directory that you intend to store the R packages
export PATH
export R_LIBS_USER
  • load the dependencies by executing the following commands in your terminal:
module load r-4.0.3-gcc-9.3.0-4l6eluj
module load tcl-8.6.10-gcc-9.3.0-p524wyh
module load tk-8.6.10-gcc-9.3.0-zd6qvt3
module load cairo-1.16.0-gcc-9.3.0-3piygqz
  • The recommended method to run the IMPC statistical pipeline is to open a new screen session (by typing screen in the terminal) and load the modules above. Ultimately running R by typing R in the terminal and running the statistical pipeline. If you do not have enough storage to store the dependency R packages, you can execute the script below in R to allow R to store the packages in ~/DRs/R/packages/4_0_3 or any other path specified. If you decide to go on this route, do not forget to change the R_LIBS_USER in the .bash_profile.
changeRpackageDirectory = function(path = '~/DRs/R/packages') {
  v = paste(
    R.version$major,
    gsub(
      pattern = '.',
      replacement = '_',
      R.version$minor,
      fixed = TRUE
    ),
    sep = '_',
    collapse = '_'
  )
  wdirc = file.path(path, v)
  if (!dir.exists(wdirc))
    dir.create(wdirc,
               showWarnings = FALSE,
               recursive = TRUE)
  .libPaths(new = wdirc)
   message(' => new package path set to: ', wdirc,'. Please add this to the .bash_profile under R_LIBS_USER  environment variable.')
}
changeRpackageDirectory()

alternatively, you can store the bash script below in a file and source it in your .bash_profile

#!/bin/bash
module load r-4.0.3-gcc-9.3.0-4l6eluj
module load tcl-8.6.10-gcc-9.3.0-p524wyh
module load tk-8.6.10-gcc-9.3.0-zd6qvt3
module load cairo-1.16.0-gcc-9.3.0-3piygqz

mkdir -p ~/DRs/R/packages/4_0_3 # OR any other directory that you intend to store the R packages
R_LIBS_USER=~/DRs/R/packages/4_0_3
export R_LIBS_USER

One line execution of the pipeline

if you have already prepared your environment, you may want to skip and jump to "One line execution of the IMPC-SP" for a one-line execution of the pipeline**

Preprocessing the raw data

The input data to the IMPC statistical pipeline (IMPC-SP) is in the form of comma-separated values (CSV), tab-separated values (TSV), Rdata (See R software data.frame) or Parquet files. The latter must be in the flattened format (no nested structure in the parquet files allowed). The CSV or TSV files can be on a remote server but parquet files must be locally available on the disk. The entire IMPC-SP require 300GB to 1.5TB disk space depending on the number of analyses/methods included in the StatPackets, the term that we assign to the IMPC-SP outputs. This document assumes the LSF cluster as the computing driver for the IMPC-SP, however, IMPC-SP can be run on a single core machine but potentially takes a significant amount of time (estimated 1.5 months).

The diagram below shows the optimal steps to run the data preparation pipeline as fast as possible.

The whole analytics steps in the IMPC-SP require R (any version) software with the list of packages and dependencies shown in the table below,

R Package name R Package name
1. DRrequiredAgeing (available from the GitHub) 2. OpenStats
3. SmoothWin 4. base64enc
5. RJSONIO 6. jsonlite
7. DBI 8. foreach
9. doParallel 10. parallel
11. nlme 12. plyr
13. rlist 14. pingr
15. robustbase 16. abind
17. stringi 18. RPostgreSQL
19. data.table 20. Tmisc
21. devtools 22. miniparquet

Alternatively, you can install all dependencies via the script below. To this end, copy and paste the script into a new R session.

The driver packages are DRrequiredAgeing, OpenStats and SmoothWin that need to be updated every time the IMPC-SP runs. This makes sure that the latest version of the software packages is used in the analysis pipeline.

  • One can update the driver packages by running the commands below from the terminal:
  1. R -e "file.copy(file.path(DRrequiredAgeing:::local(), 'StatsPipeline/jobs/UpdatePackagesFromGithub.R') , to = file.path(getwd(), 'UpdatePackagesFromGithub.R'))"
  2. Rscript UpdatePackagesFromGithub.R

Having the packages updated, the first step is to read the input files. CSV, TSV and Rdata files can be directly read in the pipeline (skip to _ Packaging the raw data for parallelisation _). Parquet files require an extra step to be converted into the R data frames. To this end, the parquet files need to be available locally on the disk. The whole process is divided into four steps, two for creating and two for executing LSF cluster jobs:

  1. Read the parquet files and create a list of jobs for the LSF cluster to process the data,
  2. Process the data and create scattered Rdata files,
  3. Create a set of jobs to merge the scattered Rdata files into single files per IMPC procedure.
  4. Run the merging step

The scripts for the 4 steps above are available from the R package DRrequiredAgeing.

Copy the contents of the directory below into a path on your machine/server

  • Path to the scripts: Run the following command from the terminal to see the full path
    • R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/0-ETL')"

There are 4 scripts in the directory that you just copied to. Run the commands below to get the dataframes ready

  1. Rscript Step1MakePar2RdataJobs "FULL PATH TO THE PARQUET FILES + trailing /"
  2. chmod 775 jobs_step2_Parquet2Rdata.bch
  • ./jobs_step2_Parquet2Rdata.bch
    1. This should take 10+ minutes on the LSF cluster depending on the available resources
    2. Do not go to step 3 before this step has finishes
    3. The output of this step is a directory named ProcedureScatterRdata filled with loads of small Rdata files
  1. Rscript Step3MergeRdataFilesJobs.R "FULL PATH TO THE ProcedureScatterRdata DIRECTORY + trailing/"
  2. chmod 775 jobs_step4_MergeRdatas.bch
  • ./jobs_step4_MergeRdatas.bch
    1. This should take 1+ hour depending on the available resources on the LSF cluster
    2. The output of this step is a directory named Rdata contains individual Rdata files for each IMPC procedures, see an example here.
  1. each step above is accompanied by the log files. If no error found in the log files then you can safely remove the ProcedureScatterRdata directory by running
  • rm –rf ProcedureScatterRdata

Packaging the raw data for parallelisation

The previous step leads to having bulky data files. This would be very inefficient for parallelization via LSF cluster. In the next step, we allow breaking the raw data into small packages that can be independently processed via parallelisation. This step is fully automated and only requires an initialisation step. The output of this step is a set of LSF jobs XXXX.bch (see an example BCM_ACS_Batch.bch) that need to be concatenated into a single file or can be used individually for each IMPC procedure. The script for this step is available from the path that comes out of the command below:

  • R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs)"

The script is named InputDataGenerator.R. You can customize the output XXXX.bch files for the amount of memory allocated to single jobs in LSF by tweaking the memory/CPU/etc. parameters in this script.

To run the InputDataGenerator.R follow the steps below:

  1. create a list of LSF jobs for each raw data file in the previous section by running the command below on your terminal:
  • R -e "DRrequiredAgeing:::jobCreator('FULL PATH TO THE Rdata DIRECTORY OR RAW DATA FILES')";
    1. This command creates a job file DataGenerationJobList.bch [see an example here] and an empty directory DataGeneratingLog [see an example here] that stores the log files.
    2. This command is similar to ls ing the directory and creates an LSF job for each input entry (data)
  • Execute the driver script by:
    1. chmod 775 DataGenerationJobList.bch
    2. ./DataGenerationJobList.bch
    3. This normally takes 15+ hours depending on the available resources on the LSF cluster.

You can check the log files in DataGeneratingLog directory for any error and if there is no error shown in the log, the preprocessing step is marked as successful. Here is the command to check the errors in the file:

  • grep "exit" * -lR

  • As the log directory could get bulky quickly, we suggest compressing the whole directory to save space on the disk. You can run the command below to compress and remove the log directory

    • zip –rm DataGeneratingLog/
  • One import note in this step is to adjust the LSF jobs configurations such as memory limit (in InputDataGenerator.R script). Overestimating the memory required for the LSF jobs prevents unwanted halt of the LSF jobs.

Executing the IMPC statistical pipeline

The output of the previous steps is a set of directories for individual IMPC procedures (see an example here) that contain an XXX.bch file. The next step is to append these XXX.bch files into a single file we call AllJobs.bch [see an example here]. You can use methods like find to search for the XXX.bch files and cat to append these files. The example of the merging command is shown below:

  • cat *.bch >> AllJobs.bch

Preparation before running the stats pipeline

Some preparation is recommended before running the stats pipeline that is listed below:

  • Updating the list of levels (for categorical data) from IMPReSS. To this end-run the command below in your terminal:
    • R -e "DRrequiredAgeing:::updateImpress(updateImpressFileInThePackage = TRUE,updateOptionalParametersList = TRUE,updateTheSkipList = TRUE)"
      • This command updates the required for analysis [see here for details] parameters as well as adds the metadata parameters [see here for details] to the skip list of the statistical pipeline. The skip list is available in the DRrequiredAgeing package directory [see here]. Run the command below to retrieve the full path:
        • Rscript "DRrequiredAgeing:::local()"

Executing the pipeline

The IMPC-SP require a function.R [see an example here] driver script written in R to perform the analysis to the data. The script is available from

  • R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs')"

put the function.R script and the AllJobs.bch in the same directory and execute the AllJobs.bch to start the IMPC-SP. Some notes are required for better understanding of the IMPC-SP.

  • You can modify some parameters in the function.R such as activating SoftWindowing. Here is the typical function.R and the parameters wherein: mainAgeing(

    • file = suppressWarnings(tail(UnzipAndfilePath(file), 1)),
      • The input file (csv,tsv, Rdata)
    • subdir = 'Results\_DR12V1OpenStats',
      • Name of the IMPC-SP output directory
    • concurrentControlSelect = FALSE,
      • Concurrent control selection applies to the controls?
    • seed = 123456,
      • Psudo random number generator seed
    • messages = FALSE,
      • Write error messages from the SoftWindowing pipeline to the output file
    • utdelim = '\t',
      • The StatPacket delimiter
    • activeWindowing = FALSE,
      • Activate SoftWindowing
    • check = 2,
      • The check type in SoftWindowing. See check argument in the SmoothWin package
    • storeplot = FALSE,
      • Store SoftWindowing plots in a file that accompanies the statpacket
    • plotWindowing = FALSE,
      • Set to TRUE to plot the SoftWindowing output
    • debug = FALSE,
      • Show more details of the process
    • MMOptimise = c(1,1,1,1,1,1),
      • See MM_Optimise in the OpenStats package
    • FERRrep = 1500,
      • Total iterations in Fisher's Exact Test framework (Monte Carlo iterations)
    • activateMulticore = FALSE,
      • Activate multi-core processing
    • coreRatio = 1,
      • The core/CPU proportion (1=100% cores, 0.5=50% cores)
    • MultiCoreErrorHandling = 'stop',
      • Error handling for multicore processing. Here the process fails if encounters and error. Note that this does not fail IMPC-SP but a specific job.
    • inorder = FALSE,
      • See inorder in foreach R package
    • verbose = TRUE,
      • see verbose in foreach R package
    • OverwriteExistingFiles = FALSE,
      • Do not overwrite the StatPacket if already exists
    • storeRawData = TRUE,
      • Store the raw data with the StatPacket
    • outlierDetection = FALSE,
      • Activate outlier detection strategy
    • compressRawData = TRUE,
      • zip the output raw data
    • writeOutputToDB = FALSE,
      • write the StatPackets to mysqlite db in a directory db in the results directory. Note that there could be up to 10k individual mysqlite databases in the db directory. An extra step requires merging all dbs.
    • onlyFillNotExisitingResults = FALSE
      • Only run the statistical analyses if the StatPacket does not exist )
  • It is highly recommended that remove the log files before running/re-running of the IMPC-SP. To do this, navigate to the AllJobs.bch directory and run the commands below in your terminal:

    • find ./*/*\_RawData/ClusterErr/ -name *ClusterErr -type f |xargs rm
    • find ./*/*\_RawData/ClusterOut/ -name *ClusterOut -type f |xargs rm
  • Results: IMPC-SP outputs is a directory named in subdir argument in function.R script. The StatPackets are located on the very right-hand side of the following directory structure path:

    • phenotyping_center/procedure_group/parameter_stable_id/colony_id/zygosity/metadata_group
      • Note 1: All special characters in the path above get replaced by the underscore (_)
      • Note 2. Depending on the input data, there could be more than one StatPacket in a path

Postprocessing the IMPC-SP

IMPC-SP require some QC checks and random validation to assure that the output results in StatPackets are reliable and there is no failure in the pipeline. Here we list some typical checks to the pipeline outputs:

  • Aggregating the log files: log files are the best place to track down any error and/or failure in the pipeline. The issue here is that the log files are scattered among the directories. To address this complexity we copy all log files to a single directory and run a check. To copy the log files, navigate to the AllJobs.bch directory and run the commands below:
    • cd ..
    • find ./*/*\_RawData/ClusterOut/ -name *ClusterOut -type f |xargs cp --backup=numbered -t ~/ **XXXXXX**
    • find ./*/*\_RawData/ClusterErr/ -name *ClusterErr -type f |xargs cp --backup=numbered -t ~/ **XXXXXX**
      • Here XXXXXX is a directory that you have created for log files such as logs [see the example here] - Searching for errors in the log file: You can search for any failure in the log files by running the command below:
    • grep "exit" * -lR
      • The existence of errors must be investigated manually
  • Random checking of the results: a random check of the results in the StatPackets is highly recommended for QC of the IMPC-SP

One line execution of the IMPC-SP

Given the R environment is already set up, one can run all steps above for the parquet inputs and the default settings by following the steps below:

  1. Prepare the R environment. To do this execute this script in your R environment.
  2. Make sure that you have LSF up and running on the computing cluster that you are running the IMPC-SP on.
  3. Make sure no job is already running on the LSF cluster. The LSF cluster must return No unfinished job found in response to the bjobs command.
  4. Open a new screen by executing screen command. This makes sure that the job is running even if the terminal is closed.
  5. Open a new R session and execute the following command DRrequiredAgeing:::StatsPipeline(path = FULL PATH TO THE PARQUET FILES, SP.results = FULL PATH TO THE OUTPUT DIRECTORY,windowingPipeline = TRUE/FALSE). Note that relative paths should be avoided in the StatsPipeline function. The output directory in SP.results will be created if already does not exist.

Post processing the pipeline:

In the meantime, there are two processes in the IMPC-SP post-processing step:

  1. Exporting data to Hadoop Cluster for downstream
  2. Creating spreadsheet reports (this is different from the reports on the FTP site)

Exporting data to Hadoop Cluster for downstream

In this process, in summary, all the Statpackets that are created during the IMPC-SP will be assigned an MP term (if possible) and will be integrated into text files and will be transferred to the Hadoop cluster for the downstream processes. The process is simple, take a StatPacket-->assign an MP term if possible--->append the StatPacket to a text file (a statpacket per row)-->transfer files to Hadoop (HDFS) storage.

To run this pipeline follow the steps below:

  1. If ETL (the process before IMPC-SP) provided you with a file called MP_Chooser.JSON then open a new R session and run the code below:
library(jsonlite) <- this package should be already installed otherwise you should install it via install.packages('jsonlite')
a=fromJSON('PATH TO THE MP_CHOOSER.JSON')
save(a,'A NAME LIKE MP_CHOOSER.Rdata')

Note that you need to keep the path for 'A NAME LIKE MP_CHOOSER.Rdata' above, otherwise, the pipeline will use an outdated version.

  1. Make sure no job is already running on the LSF cluster
  2. Make sure that your user name has already been granted access to the Hadoop cluster, e.g. run ssh hadoop-login-02
  3. Open a new screen by executing screen in your terminal
  4. Open a new R session (by typing R in the terminal) and run the command below. Note that you may need to change some parameters if required however you need to set prefix to prevent overwriting the files (and preventing mess). By finishing the process below, there should be several .statpackets files in the path specified in the /hadoop/user/USER/impc/statpackets/path/prefix. If you do not know anything about the function below or R, just navigate to where the IMPC-SP results are stored (somewhere normally like [parquet files]/SP/Results_IMPC_SP_XXX and run the command below with only prefix set (like DRrequiredAgeing:::IMPC_HadoopLoad(prefix='A PREFIX')
DRrequiredAgeing:::IMPC_HadoopLoad(
  SP.results = 'The path the stats results are stored - by default [parquet files]/SP/Results_IMPC_SP_XXX',
  mp_chooser_file = 'the full path to the MP_CHOOSER that you provided above - default the internally stored mp_chooser',
  host = 'Hadoop cluster host - default hh-hdp-master-02.ebi.ac.uk',
  path = 'Path to store the data on Hadoop - default impc/statpackets - no leading slash required',
  prefix = 'The DR version like DRXXX_',
  port = 'Hadoop port - default 50070',
  user =  'Your Hadoop user name - default Sys.info()["user"]',
  level = 'The significance level for the all models except RR+ - default 10^-4',
  rrlevel = 'The significance level for RR+ model - default 10^-4'
)

The whole process could take up to 8 hours depending on how busy the EBI cluster is.

Frequently asked questions

Here we answer some of the frequently asked questions.

  • Where are the IMPC-SP on Github?
  • Where can I find function.R?
    • This file is located in the extension directory of the DRrequiredAgeing package. See here for the files.
  • How long normally the IMPC-SP takes to complete?
    • This depends on the LSF cluster and the available resources. Setting the EMBL-EBI LSF cluster as a reference, the whole process takes from 2 days to 4 days.
  • How to run the One-line-execution of the IMPC-SP under different settings?
    • This requires changing the setting in the DRrequiredAgeing package, precisely the scripts in here. One can fork the repository and change the settings, then update the package from GitHub using devtools::install_github command in R.
  • How to ask for help?
Clone this wiki locally