Skip to content

Tumor Phylogeny Reconstruction via Integrative use of Single Cell and Bulk Sequencing Data

Notifications You must be signed in to change notification settings

sfu-compbio/PhISCS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhISCS

PhISCS is a tool for sub-perfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. If bulk sequencing data is used, we expect that mutations originate from diploid regions of the genome. Due to variance in VAF values, we recommend the use of bulk data in cases when sequencing depth is at least 1000x (haploid coverage). As output, PhISCS reports tree of tumor evolution together with a set of eliminated mutations, where eliminated mutations represent mutations violating Infinite Sites Assumption (due to deletion of variant allele or due to recurrent mutation) or mutations affected by copy number aberrations that were missed during the tumor copy number profiling (e.g. gain of non-variant allele).

PhISCS has been published in Genome Research (doi:10.1101/gr.234435.118). If you find this code useful in your research, please consider citing.

@article{malikic2019phiscs,
  doi           = {10.1101/gr.234435.118},
  url           = {https://doi.org/10.1101/gr.234435.118},
  year          = 2019,
  month         = oct,
  publisher     = {Cold Spring Harbor Laboratory},
  volume        = {29},
  number        = {11},
  pages         = {1860--1877},
  author        = {Salem Malikic and Farid {Rashidi Mehrabadi} and Simone Ciccolella and Md. Khaledur Rahman and Camir Ricketts and Ehsan Haghshenas and Daniel Seidman and Faraz Hach and Iman Hajirasouliha and S. Cenk Sahinalp},
  title         = {{{PhISCS}: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data}},
  journal       = {Genome Research}
}

Contents

  1. Installation
  2. Running
  3. Example
  4. Contact

Installation

PhISCS is written in Python and C. It supports both Python 2.7 and 3. Currently it is intended to be run on POSIX-based systems (only Linux and macOS have been tested).

RECOMENDATION: At the moment, in cases when both, single-cell and bulk data are used as input, we recommend the use of PhISCS-I over PhISCS-B (due to more thorough tests and software validation that we have performed for PhISCS-I). However, when single-cell data is the only input, we have extensively tested both implementations and, since PhISCS-B can have potential running time advantage in this case, we recommend its use over PhISCS-I.

PhISCS-I

git clone --recursive https://github.com/sfu-compbio/PhISCS.git
cd PhISCS
python PhISCS-I --help

Prerequisite: ILP solver

In order to run PhISCS-I, the main requirement is the installation of Gurobi solver. Gurobi a commercial solver which is free for academic purposes. After installing it, installation of gurobipy package is necessary prior to being able to successfully run PhISCS-I (below we provide some examples of the input and commands used to run the tool).

PhISCS-B

git clone --recursive https://github.com/sfu-compbio/PhISCS.git
cd PhISCS
./PhISCS-B-configure
python PhISCS-B --help

Prerequisite: CSP solver

Some of CSP solver have been already included in the PhISCS package. There is an option to add a new CSP solver to PhISCS-B by provinding a path to the exe file of the desired CSP solver.

Running

Input

1. Single-cell Matrix

Single-cell input is assumed to be represented in the form of ternary, tab-delimited, matrix with rows corresponding to single-cells and columns corresponding to mutations. We assume that this file contains headers and that matrix is ternary matrix with 0 denoting the absence and 1 denoting the presence of mutation in a given cell, whereas ? represents the lack of information about presence/absence of mutation in a given cell (i.e. missing entry). In order to simplify parsing of the matrix, we also assume that upper left corner equals to string cellID/mutID.

Below is an example of single-cell data matrix. Note that mutation and cell names are arbitrary strings not containing tabs or spaces, however they must be unique.

cellID/mutID  mut0  mut1  mut2  mut3  mut4  mut5  mut6  mut7
cell0         0     0     ?     0     0     0     0     0
cell1         0     ?     1     0     0     0     1     1
cell2         0     0     1     0     0     0     1     1
cell3         1     1     0     0     0     0     0     0
cell4         0     0     1     0     0     0     0     0
cell5         1     0     0     0     0     0     0     0
cell6         0     0     1     0     0     0     1     1
cell7         0     0     1     0     0     0     0     0
cell8         ?     0     0     0     ?     0     ?     1
cell9         0     1     0     0     0     0     0     0

2. Bulk Data

As bulk data input, we also expect tab-delimited file with the following columns:

ID which represents mutational ID (used in single-cell data matrix for the same mutation)
Chromosome which represents chromosome of the mutation (any string not containing tabs or empty spaces)
Position which represents position (on chromosome) of the mutation (any string/number not containing tabs or empty spaces)
MutantCount is the number of mutant reads in the bulk data. If multiple bulk samples are used, values are semicolon-delimited and provided in the sorted order of samples (this order is expected to be same for all mutations, e.g. first number always representing read count in sample 1, second number in sample 2 etc.)
ReferenceCount is the number of reference reads in the bulk data. If multiple bulk samples are used, values are semicolon-delimited and provided in the sorted order of samples (this order is expected to be same for all mutations, e.g. first number always representing read count in sample 1, second number in sample 2 etc.)
INFO which contains additional information about the mutation and is semicolon-delimited. Entries in this column are of the form: entryID=values, where values are delimited by commas. An example of INFO column is: "sampleIDs=S0,S1;synonymous=false;exonic=true". The only obligatory information required now is information about sample origins (in cases of absence of them, arbitrary distinct strings can be used, e.g. sampleIDs=S0,S1,S2;)

As an example:

ID    Chromosome  Position  MutantCount     ReferenceCount    INFO
mut0  1           0         766;511;688     4234;4489;4312    sampleIDs=primary,metastasis1,metastasis2
mut1  1           1         719;479;719     4281;4521;4281    sampleIDs=primary,metastasis1,metastasis2
mut2  1           2         1246;1094;859   3754;3906;4141    sampleIDs=primary,metastasis1,metastasis2
mut3  1           3         298;226;272     4702;4774;4728    sampleIDs=primary,metastasis1,metastasis2
mut4  1           4         353;227;255     4647;4773;4745    sampleIDs=primary,metastasis1,metastasis2
mut5  1           5         306;232;279     4694;4768;4721    sampleIDs=primary,metastasis1,metastasis2
mut6  1           6         725;449;492     4275;4551;4508    sampleIDs=primary,metastasis1,metastasis2
mut7  1           7         703;417;507     4297;4583;4493    sampleIDs=primary,metastasis1,metastasis2

(in the example of bulk file shown above, we have that for mut0 number of mutant and reference reads in the first sample are respectively 766 and 4234, in the second sample 511 and 4489 and in the third sample 688 and 4312).

Output

The program will generate two files in OUT_DIR folder (which is set by argument -o or --outDir). This folder will be created automatically if it does not exist.

1. Output Matrix File

The output matrix is also a tab-delimited file having the same format as the input matrix, except that eliminated mutations (columns) are excluded (so, in case when mutation elimination is allowed, this matrix typically contains less columns than the input matrix). Output matrix represents genotypes-corrected matrix (where false positives and false negatives from the input are corrected and each of the missing entries set to 0 or 1). Suppose the input file is INPUT_MATRIX.ext, the output matrix will be stored in file OUT_DIR/INPUT_MATRIX.CFMatrix. For example:

 input file: data/ALL2.SC
output file: OUT_DIR/ALL2.CFMatrix

2. Log File

Log file contains various information about the particular run of PhISCS (e.g. eliminated mutations or likelihood value). The interpretation of the relevant reported entries in this file is self-evident. Suppose the input file is INPUT_MATRIX.ext, the log will be stored in file OUT_DIR/INPUT_MATRIX.log. For example:

input file: data/ALL2.SC
  log file: OUT_DIR/ALL2.log

Parameters

Parameter Description Default Mandatory
-SCFile Path to single-cell data matrix file - 🔘
-fn Probablity of false negative - 🔘
-fp Probablity of false positive - 🔘
-o Output directory current
-kmax Max number of mutations to be eliminated 0
-threads Number of threads (supported by PhISCS-I) 1
-bulkFile Path to bulk data file -
-delta Delta parameter accounting for VAF variance 0.20
-time Max time (in seconds) allowed for the computation 24 hours
--drawTree Draw output tree with Graphviz -

Example

For running PhISCS without VAFs information and without ISA violations:

python PhISCS-I -SCFile example/input.SC -fn 0.2 -fp 0.0001 -o result/

For running PhISCS without VAFs information but with ISA violations:

python PhISCS-I -SCFile example/input.SC -fn 0.2 -fp 0.0001 -o result/ -kmax 1

For running PhISCS with both VAFs information and ISA violations (with time limit of 24 hours):

python PhISCS-I -SCFile example/input.SC -fn 0.2 -fp 0.0001 -o result/ -kmax 1 -bulkFile example/input.bulk -time 86400

For running PhISCS with VAFs information but no ISA violations (with drawing the output tree):

python PhISCS-I -SCFile example/input.SC -fn 0.2 -fp 0.0001 -o result/ -bulkFile example/input.bulk --drawTree

Contact

If you have any questions please e-mail us at smalikic@sfu.ca or frashidi@iu.edu.

About

Tumor Phylogeny Reconstruction via Integrative use of Single Cell and Bulk Sequencing Data

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages