PreComputeCapice

When you want the full capice experience, but don't want to use the model

This is a program that works on the wonderfull CAPICE predictive model.

Introduction

'PreComputeCapice' is a program developed by R.J. Sietsma and maintained by the Genomic Coordination Centre Groningen. The program calculates capice scores for all entries in a given CADD file.

Prerequisites

This program is developed in Pycharm 2020.1 (Professional Edition), performance on other systems is not guaranteed.

The program requires the following packages:

numpy (v1.18.3; BSD 3-Clause License)
pandas (v1.0.3; BSD 3-Clause License)
psutil (v5.7.0; BSD 3-Clause License)
scipy (v1.4.1; BSD 3-Clause License)
scikit-learn (v0.19.1; BSD 3-Clause License)
xgboost (v0.72.1; Apache 2 License)

Warning: this program works for python version 3.6, it does not work for python 3.7 or higher because of numpy, scikit-learn and xgboost version dependency issues.

Installing

Step 1: acquire the source files Either clone or download the source files.

Step 2: activate the virtual environment

Open a terminal in the cloned or downloaded folder.
Make sure you have python3.6 installed by typing python3 --version.
Execute the following:

mkdir ./venv
cd ./venv
python3 -m venv ./
cd ..

Activate the virtual environment:

source ./venv/bin/activate

Install the required packages by executing:

pip install -r requirements.txt

Note: if any package fails to install, please try to install the package using:

pip install package==version

Make sure you have the virtual environment enabled before you install packages!

Usage

The program requires the following arguments:

-f / --file: the cadd file in gzip (.gz) format.
-m / --model: the pickled capice model in .dat format.
-o / --output: the location where the program should place it's files.

Optional argument:

-s / --batchsize: the amount of rows the program should read each iteration from the CADD file.

Example usage:

with a batch size of 1 million

python3 PreComputeCapice.py -f path/to/cadd/file.gz -m path/to/model.dat -o path/to/output/folder -s 1000000

Output

The program will output the following files:

For each chromosome in the CADD file, it makes a folder named chrx (where x = chromosome) and places a gzipped tsv of all CADD entries for that chromosome. Note: The program continually adds entries to this file, do NOT remove or replace this file till the program is done!
Log_output: a file with timed messages on updates within the program. (Does not contain error messages or warnings).
progression_json: a json file containing set parameters, like batch_size, to keep track of progress during the programs execution.

TODO:

Make input file (-f / --file) also specific for the progression.json.
Refactoring and optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
src		src
.gitignore		.gitignore
PreComputeCapice.py		PreComputeCapice.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PreComputeCapice

When you want the full capice experience, but don't want to use the model

This is a program that works on the wonderfull CAPICE predictive model.

Introduction

Prerequisites

Installing

Usage

Output

TODO:

About

Releases 1

Packages

Languages

SietsmaRJ/calculate_capice_precompute_scores

Folders and files

Latest commit

History

Repository files navigation

PreComputeCapice

When you want the full capice experience, but don't want to use the model

This is a program that works on the wonderfull CAPICE predictive model.

Introduction

Prerequisites

Installing

Usage

Output

TODO:

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages