This is a program that works on the wonderfull CAPICE predictive model.
'PreComputeCapice' is a program developed by R.J. Sietsma and maintained by the Genomic Coordination Centre Groningen. The program calculates capice scores for all entries in a given CADD file.
This program is developed in Pycharm 2020.1 (Professional Edition), performance on other systems is not guaranteed.
The program requires the following packages:
- numpy (v1.18.3; BSD 3-Clause License)
- pandas (v1.0.3; BSD 3-Clause License)
- psutil (v5.7.0; BSD 3-Clause License)
- scipy (v1.4.1; BSD 3-Clause License)
- scikit-learn (v0.19.1; BSD 3-Clause License)
- xgboost (v0.72.1; Apache 2 License)
Warning: this program works for python version 3.6, it does not work for python 3.7 or higher because of numpy, scikit-learn and xgboost version dependency issues.
Step 1: acquire the source files Either clone or download the source files.
Step 2: activate the virtual environment
- Open a terminal in the cloned or downloaded folder.
- Make sure you have python3.6 installed by typing python3 --version.
- Execute the following:
mkdir ./venv
cd ./venv
python3 -m venv ./
cd ..
- Activate the virtual environment:
source ./venv/bin/activate
- Install the required packages by executing:
pip install -r requirements.txt
Note: if any package fails to install, please try to install the package using:
pip install package==version
Make sure you have the virtual environment enabled before you install packages!
The program requires the following arguments:
- -f / --file: the cadd file in gzip (.gz) format.
- -m / --model: the pickled capice model in .dat format.
- -o / --output: the location where the program should place it's files.
Optional argument:
- -s / --batchsize: the amount of rows the program should read each iteration from the CADD file.
Example usage:
with a batch size of 1 million
python3 PreComputeCapice.py -f path/to/cadd/file.gz -m path/to/model.dat -o path/to/output/folder -s 1000000
The program will output the following files:
- For each chromosome in the CADD file, it makes a folder named chrx (where x = chromosome) and places a gzipped tsv of all CADD entries for that chromosome. Note: The program continually adds entries to this file, do NOT remove or replace this file till the program is done!
- Log_output: a file with timed messages on updates within the program. (Does not contain error messages or warnings).
- progression_json: a json file containing set parameters, like batch_size, to keep track of progress during the programs execution.
- Make input file (-f / --file) also specific for the progression.json.
- Refactoring and optimization.