CNV-P: A machine-learning framework for filtering copy number variations

CNV-P is a novel and post–processing approach for CNV filtering.

Cite this as
Wang T, Sun J, Zhang X, Wang W, Zhou Q. 2021. CNV-P: a machine-learning framework for predicting high confident copy number variations. PeerJ 9:e12564 https://doi.org/10.7717/peerj.12564

Prerequisites:

python3
sklearn
matplotlib
pysam
pandas
numpy

Install by conda

conda install python=3.7.0
conda install -c anaconda scikit-learn=0.21
conda install -c conda-forge matplotlib
conda install -c bioconda pysam
conda install -c anaconda pandas
conda install -c anaconda numpy

Getting started

1. CNV Predicting

run the "python script/CNV-P_predict_main.py -h" to see the USAGE;

usage: CNV-P_predict_main.py [-h] [-m model] -b bamfile -s CNVcaller -bed
                             BEDfile -bas basfile [-sam Samplename]
                             [-o outdir]
optional arguments:
  -h, --help            show this help message and exit
  -m model, --model model
                        which model you want to used
  -b bamfile, --bam bamfile
                        file provides features
  -s CNVcaller, --soft CNVcaller
                        which CNV caller you used
  -bed bedfile, --CNV_bed bedfile
                        the format of input file
  -bas basfile, --basfile basfile
                        file that provide mean insert-size and sequencing
                        depth
  -n Samplename, --Samplename Samplename
                        Samplename that use as prefix of result
  -o outdir, --output outdir
                         output directory

1.1 input parameters

model: should be one of RF (Random Forest), GBC (Gradient Boosting classifier) and SVM (Support Vector Machine)
CNVcaller: Lumpy, Manta, Pindel, Delly and breakdancer is currently supported, for Other software needs to be pre-trained(see 2.training for other CNV callers)
bamfile: BAM file should generated by a read aligner that supports partial read alignments, such as BWA-MEM
bedfile: This file should be 5 Columns: chromsome, start, end, length of CNV, type of CNV (DUP:1,DEL:0)
for example (test_data/HG002.Lumpy.fil.mer.bed):

chr19	350768	351961	1194	1
chr19	434243	434587	345	1
chr19	566222	569347	3126	0
chr19	878739	879857	1119	1
chr19	1182660	1183097	438	0
chr19	1572816	1573149	334	0
chr19	2033040	2033182	143	0
chr19	2713161	2714159	999	0

basfile: this file should be 4 columns: Samplename, median value of insert size, standard deviation of insert size, coverage
for example (test_data/HG002.bam.bas):

Samplename	median_insert_size	insert_size_median_sd	coverage
HG002	568.177944	163.819637	35.41

1.2 output

samplename.feature.txt: Extracted feature matrix.
samplename.pre.prop.txt: The prediction result and probability score. Including 7 columns:

ChrID: Chromosome (e.g. chr3, chrY)
start: Start coordinate on the chromosome 
end: End coordinate on the chromosome
length: length of CNV
CNV_type: type of CNV (DUP:1,DEL:0)
class: predicting results (true CNV：1 ,false CNV: 0)
probability_score: Probability of this CNV to be true

1.3 running example

python  script/CNV-P_predict_main.py  -m RF -b Test_data/HG002.test.bam -s Lumpy -bed Test_data/HG002.Lumpy.fil.mer.bed -bas Test_data/HG002.bam.bas -sam HG002 -o Test_data/out/

2. training for other CNV callers

For training a model for other CNV callers, use 'CNV-P_featureExtract_main.py' to perform features extraction:

python script/CNV-P_featureExtract_main.py -b test-data/HG002.test.bam -bed test-data/HG002.Lumpy.fil.mer.bed -bas test-data/HG002.bam.bas -sam HG002 -o test-data/out/

then，run the "script/CNV-P_training_main.py" to train a model
run " python script/CNV-P_training_main.py -h " to see the USAGE;

usage: CNV-P_training_main.py [-h] [-m model] -s CNVcaller -fea featuresfile
                              -lab labelfile [-o outdir]
optional arguments:
  -h, --help            show this help message and exit
  -m model, --model model
                        which model you want to used
  -s CNVcaller, --soft CNVcaller
                        which CNV caller you used
  -fea featuresfile, --features featuresfile
                        file that provide traing features
  -lab labelfile, --labelfile labelfile
                        file that provide CNV label, true CNVs labeled as
                        1,false CNVs labeled as 0, The order should
                        corresponds to CNV_bed file(-bed/--CNV_bed) one to one
  -o outdir, --output outdir
                         output directory

2.1 input parameters

featuresfile: file that provide traing features, results from 'CNV-P_featureExtract_main.py'
labelfile: one column, true CNVs labeled as 1,false CNVs labeled as 0
for example (see test-data/HG002.Lumpy.chr1.label.txt):

2.2 outputs:

CNVcaller.model.train_model.m: the classifier you trained
CNV-P_CNVcaller_model_Classifier.ROC.pdf, CNV-P_CNVcaller_model_Classifier.ROC.png: the ROC of 10fold-cross_validation

2.3 running example

python script/CNV-P_training_main.py -s Lumpy -fea test-data/HG002.Lumpy.chr1.feature.txt -lab test-data/HG002.Lumpy.chr1.label.txt -o test-data/out/

Please help us improve CNV-P by reporting bugs or ideas on how to make things better.

Comparison with CNV-JACG, MetaSV and hard cutoff method

We compared the performance of CNV-P with that of CNV- JACG (Zhuang et al. 2020), MetaSV (Mohiyuddin et al. 2015) and hard cutoff method in the same datasets. Since MetaSV currently does not support Delly's output, only four CNV detection tools (Lumpy, Manta, Pindel, and breakdancer) were taken into consideration. CNV-JACG was conducted running with default parameters. MetaSV was carried out with complete mode. For hard cutoff method, we used SR and RP as the evidence to support the existence of CNVs, therefore, the number of SR and RP greater than 2, 5, and 10 were set as hard cutoff to evaluate. SURVIVOR(Jeffares et al. 2017) was used to merge fragments with 80% overlap after filtering by CNV-P, CNV- JACG, MetaSV and hard cutoff method.

Process framework：

Comparison with CNV-JACG, MetaSV and hard cutoff method in NA12878 and HG002.

Sample	method	precision	recall	F1-score
NA12878	RAW	0.6032	1.0000	0.7525
	Hard_Cutoff_2	0.6197	0.9792	0.7590
	Hard_Cutoff_5	0.7145	0.8630	0.7818
	Hard_Cutoff_10	0.7780	0.6976	0.7356
	CNV-JACG	0.6828	0.7496	0.7146
	MetaSV	0.7094	0.8817	0.7862
	CNV-P	0.9007	0.7977	0.8461

HG002	RAW	0.2054	1.0000	0.3408
	Hard_Cutoff_2	0.4026	0.9729	0.5695
	Hard_Cutoff_5	0.5740	0.8653	0.6901
	Hard_Cutoff_10	0.6642	0.7482	0.7037
	CNV-JACG	0.5443	0.7076	0.6153
	MetaSV	0.5917	0.8274	0.6900
	CNV-P	0.7078	0.7516	0.7290

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
model		model
script		script
test-data		test-data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNV-P: A machine-learning framework for filtering copy number variations

Prerequisites:

Install by conda

Getting started

1. CNV Predicting

1.1 input parameters

1.2 output

1.3 running example

2. training for other CNV callers

2.1 input parameters

2.2 outputs:

2.3 running example

Comparison with CNV-JACG, MetaSV and hard cutoff method

About

Releases

Packages

Languages

License

wonderful1/CNV-P

Folders and files

Latest commit

History

Repository files navigation

CNV-P: A machine-learning framework for filtering copy number variations

Prerequisites:

Install by conda

Getting started

1. CNV Predicting

1.1 input parameters

1.2 output

1.3 running example

2. training for other CNV callers

2.1 input parameters

2.2 outputs:

2.3 running example

Comparison with CNV-JACG, MetaSV and hard cutoff method

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages