Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
bentsherman committed Dec 17, 2018
2 parents 16d4bb1 + 38ab56d commit c9a3177
Show file tree
Hide file tree
Showing 147 changed files with 8,288 additions and 3,781 deletions.
64 changes: 39 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,47 +9,61 @@ Correlation
- Spearman

Clustering
- K-means
- Gaussian mixture models

Thresholding
- Power-law
- Random matrix theory

# Installation
KINC is built with [ACE](https://github.com/SystemsGenetics/ACE), a framework which provides mechanisms for large-scale heterogeneous computing and data management. As such, KINC can be run in a variety of compute configurations, including single-core / single-GPU and multi-core / multi-GPU, and KINC uses its own binary file formats to represent the data objects that it produces. Each of these binary formats can be exported to a plain-text format for use in other applications.

This software uses GSL, OpenCL, and [ACE](https://github.com/SystemsGenetics/ACE). For instructions on installing ACE, see the project repository. For all other dependencies, consult your package manager. For example, to install dependencies on Ubuntu:
```
sudo apt install libgsl2 ocl-icd-opencl-dev libopenmpi-dev
```
## Installation

Refer to the files under `docs` for installation instructions. KINC is currently supported on most flavors of Linux.

To build & install KINC:
### Palmetto

To use KINC on Palmetto, you must add the following modules in lieu of installing dependencies through a package manager:
```bash
module add cuda-toolkit/9.2
module add gcc/5.4.0
module add git
module add gsl/2.3
module add openmpi/1.10.7
module add Qt/5.9.2
```
cd build
qmake ../src/KINC.pro
make qmake_all
make
make qmake_all
make install

## Usage

KINC provides two executables: `kinc`, the command-line version, and `qkinc`, the GUI version. The command-line version can use MPI while the GUI version can display data object files that are produced by KINC. KINC produces a gene-coexpression network in several steps:
1. `import-emx`: Import expression matrix text file into binary format
2. `similarity`: Compute a cluster matrix and correlation matrix from expression matrix
3. `threshold`: Determine an appropriate correlation threshold for correlation matrix
4. `extract`: Extract an edge list from a correlation matrix given a threshold

Below is an example usage of `kinc` on the Yeast dataset:
```
# import expression matrix into binary format
kinc run import-emx --input Yeast-GEM.txt --output Yeast.emx --nan NA
## Using the KINC GUI or Console
# compute similarity matrix (with GMM clustering)
mpirun -np 8 kinc run similarity --input Yeast.emx --ccm Yeast.ccm --cmx Yeast.cmx --clusmethod gmm --corrmethod spearman --minclus 1 --maxclus 5
ACE provides two different libraries for GUI and console applications. The `kinc` executable is the console or command line version and the `qkinc` executable is the GUI version.
# determine correlation threshold
kinc run rmt --input Yeast.cmx --log Yeast.log
# Usage
# read threshold from log file
THRESHOLD=$(tail -n 1 Yeast.log)
To build a GCN involves several steps:
# extract network file from thresholded similarity matrix
kinc run extract --emx Yeast.emx --ccm Yeast.ccm --cmx Yeast.cmx --output Yeast-net.txt --mincorr $THRESHOLD
```

1. Import expression matrix
2. Compute cluster composition matrix
3. Compute correlation matrix
4. Compute thresholded correlation matrix
A more thorough example usage is provided in `scripts/run-all.sh`.

# Troubleshooting
## An error occurred in MPI_Init
KINC requires MPI as a dependency, but on most systems you can execute the command-line KINC as a stand-alone tool without using 'mpirun'. This is because KINC checks during runtime if MPI is appropriate for execution. However, on a SLURM cluster where MPI jobs must be run using the srun command and where PMI2 is compiled into MPI, then KINC cannot be executed stand-alone. It must be executed using srun with the --mpi argument set to pmi2. For example:
### Running KINC on SLURM

Although KINC is an MPI application, generally you can run `kinc` as a stand-alone application without `mpirun` and achieve normal serial behavior. However, on a SLURM cluster where MPI jobs must be run with the `srun` command and where PMI2 is compiled into MPI, `kinc` cannot be executed stand-alone. It must be executed using `srun` with the additional argument `--mpi=pmi2`. For example:
```
srun --mpi=pmi2 kinc run import_emx --input Yeast-ematrix.txt --output Yeast.emx --nan NA
```

3 changes: 0 additions & 3 deletions build-tests/.gitignore

This file was deleted.

15 changes: 9 additions & 6 deletions docs/Ubuntu_16_04_Setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Use the following steps to setup KINC for development on Ubuntu 16.04:

Most of the dependencies are available as packages:
```bash
sudo apt install g++ libgsl-dev libopenblas-dev libopenmpi-dev ocl-icd-opencl-dev
sudo apt install build-essential libgsl-dev libopenblas-dev libopenmpi-dev ocl-icd-opencl-dev
```

For device drivers (AMD, Intel, NVIDIA, etc), refer to the manufacturer's website.
Expand All @@ -25,7 +25,7 @@ If you install Qt locally then you must add Qt to the executable path:

```bash
# append to ~/.bashrc
export QTDIR="$HOME/Qt/5.10.1/gcc_64"
export QTDIR="$HOME/Qt/5.7.1/gcc_64"
export PATH="$QTDIR/bin:$PATH"
```

Expand All @@ -34,8 +34,8 @@ export PATH="$QTDIR/bin:$PATH"
Clone the ACE and KINC repositories from Github.

```bash
git clone git@github.com:SystemsGenetics/ACE.git
git clone git@github.com:SystemsGenetics/KINC.git
git clone https://github.com/SystemsGenetics/ACE.git
git clone https://github.com/SystemsGenetics/KINC.git
```

## Step 3: Build ACE and KINC
Expand All @@ -45,14 +45,17 @@ Follow the ACE instructions to build ACE. If you install ACE locally then you mu
```bash
# append to ~/.bashrc
export INSTALL_PREFIX="$HOME/software"
export PATH="$INSTALL_PREFIX/bin:$PATH"
export CPLUS_INCLUDE_PATH="$INSTALL_PREFIX/include:$CPLUS_INCLUDE_PATH"
export LIBRARY_PATH="$INSTALL_PREFIX/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="$INSTALL_PREFIX/lib:$LD_LIBRARY_PATH"
```

Build & install KINC:

```bash
cd build
qmake ../src/KINC.pro
qmake ../src/KINC.pro PREFIX=$INSTALL_PREFIX
make qmake_all
make
make qmake_all
Expand All @@ -63,4 +66,4 @@ You should now be able to run KINC.

## (Optional) Use QtCreator

Select **File** > **Open File or Project** and then navigate in the file browser to the ACE directory and select the ACE.pro file. Navigate through configure setup. Repeat for KINC.
Select __File__ > __Open File or Project__ and then navigate in the file browser to the ACE directory and select the ACE.pro file. Navigate through configure setup. Repeat for KINC.
58 changes: 58 additions & 0 deletions scripts/extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import argparse
import pandas as pd



if __name__ == "__main__":
# parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument("--emx", required=True, help="expression matrix file", dest="EMX")
parser.add_argument("--cmx", required=True, help="correlation matrix file", dest="CMX")
parser.add_argument("-o", "--output", required=True, help="output net file", dest="OUTPUT")
parser.add_argument("--mincorr", type=float, default=0, help="minimum absolute correlation threshold", dest="MINCORR")
parser.add_argument("--maxcorr", type=float, default=1, help="maximum absolute correlation threshold", dest="MAXCORR")

args = parser.parse_args()

# load data
emx = pd.read_table(args.EMX)
cmx = pd.read_table(args.CMX, header=None, names=[
"x",
"y",
"Cluster",
"Num_Clusters",
"Cluster_Samples",
"Missing_Samples",
"Cluster_Outliers",
"Pair_Outliers",
"Too_Low",
"sc",
"Samples"
])

# extract correlations within thresholds
cmx = cmx[(args.MINCORR <= abs(cmx["sc"])) & (abs(cmx["sc"]) <= args.MAXCORR)]

# insert additional columns used in netlist format
cmx.insert(len(cmx.columns), "Source", [emx.index[x] for x in cmx["x"]])
cmx.insert(len(cmx.columns), "Target", [emx.index[y] for y in cmx["y"]])
cmx.insert(len(cmx.columns), "Interaction", ["co" for idx in cmx.index])

# reorder columns to netlist format
cmx = cmx[[
"Source",
"Target",
"sc",
"Interaction",
"Cluster",
"Num_Clusters",
"Cluster_Samples",
"Missing_Samples",
"Cluster_Outliers",
"Pair_Outliers",
"Too_Low",
"Samples"
]]

# save output data
cmx.to_csv(args.OUTPUT, sep="\t", index=False)
73 changes: 73 additions & 0 deletions scripts/run-all-py.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/bin/bash

# parse command-line arguments
if [[ $# != 1 ]]; then
echo "usage: $0 <infile>"
exit -1
fi

# define analytic flags
DO_SIMILARITY=1
DO_THRESHOLD=1
DO_EXTRACT=1

# define input/output files
DATA="data"
EMX_FILE="$1"
CMX_FILE="$DATA/$(basename $EMX_FILE .txt)-cmx-py.txt"
NET_FILE="$DATA/$(basename $EMX_FILE .txt)-net-py.txt"

# similarity
if [[ $DO_SIMILARITY = 1 ]]; then
CLUSMETHOD="gmm"
CORRMETHOD="pearson"
MINEXPR="-inf"
MINCLUS=1
MAXCLUS=5
CRITERION="bic"
PREOUT="--preout"
POSTOUT="--postout"
MINCORR=0
MAXCORR=1

python scripts/similarity.py \
-i $EMX_FILE \
-o $CMX_FILE \
--clusmethod $CLUSMETHOD \
--corrmethod $CORRMETHOD \
--minexpr=$MINEXPR \
--minclus $MINCLUS --maxclus $MAXCLUS \
--crit $CRITERION \
$PREOUT $POSTOUT \
--mincorr $MINCORR --maxcorr $MAXCORR
fi

# threshold
if [[ $DO_THRESHOLD = 1 ]]; then
NUM_GENES=$(expr $(cat $EMX_FILE | wc -l) - 1)
METHOD="rmt"
TSTART=0.99
TSTEP=0.001
TSTOP=0.50

python scripts/threshold.py \
-i $CMX_FILE \
--genes $NUM_GENES \
--method $METHOD \
--tstart $TSTART \
--tstep $TSTEP \
--tstop $TSTOP
fi

# extract
if [[ $DO_EXTRACT = 1 ]]; then
MINCORR=0
MAXCORR=1

python scripts/extract.py \
--emx $EMX_FILE \
--cmx $CMX_FILE \
--output $NET_FILE \
--mincorr $MINCORR \
--maxcorr $MAXCORR
fi
108 changes: 108 additions & 0 deletions scripts/run-all.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
#!/bin/bash

# parse command-line arguments
if [[ $# != 1 ]]; then
echo "usage: $0 <infile>"
exit -1
fi

GPU=1

# define analytic flags
DO_IMPORT_EMX=1
DO_SIMILARITY=1
DO_EXPORT_CMX=1
DO_THRESHOLD=1
DO_EXTRACT=1

# define input/output files
INFILE="$1"
DATA="data"
EMX_FILE="$DATA/$(basename $INFILE .txt).emx"
CCM_FILE="$DATA/$(basename $EMX_FILE .emx).ccm"
CMX_FILE="$DATA/$(basename $EMX_FILE .emx).cmx"
LOGS="logs"
RMT_FILE="$LOGS/$(basename $CMX_FILE .cmx).txt"

# apply settings
if [[ $GPU == 1 ]]; then
kinc settings set opencl 0:0
kinc settings set threads 4
kinc settings set logging off

NP=1
else
kinc settings set opencl none
kinc settings set logging off

NP=$(nproc)
fi

# import emx
if [[ $DO_IMPORT_EMX = 1 ]]; then
kinc run import-emx \
--input $INFILE \
--output $EMX_FILE \
--nan NA
fi

# similarity
if [[ $DO_SIMILARITY = 1 ]]; then
CLUSMETHOD="gmm"
CORRMETHOD="pearson"
MINEXPR="-inf"
MINCLUS=1
MAXCLUS=5
CRITERION="BIC"
PREOUT="--preout"
POSTOUT="--postout"
MINCORR=0.5
MAXCORR=1

mpirun -np $NP kinc run similarity \
--input $EMX_FILE \
--ccm $CCM_FILE \
--cmx $CMX_FILE \
--clusmethod $CLUSMETHOD \
--corrmethod $CORRMETHOD \
--minexpr $MINEXPR \
--minclus $MINCLUS --maxclus $MAXCLUS \
--crit $CRITERION \
$PREOUT $POSTOUT \
--mincorr $MINCORR --maxcorr $MAXCORR
fi

# export cmx
if [[ $DO_EXPORT_CMX = 1 ]]; then
OUTFILE="$DATA/$(basename $CMX_FILE .cmx)-cmx.txt"

kinc run export-cmx \
--emx $EMX_FILE \
--ccm $CCM_FILE \
--cmx $CMX_FILE \
--output $OUTFILE
fi

# threshold
if [[ $DO_THRESHOLD = 1 ]]; then
mkdir -p $LOGS

kinc run rmt \
--input $CMX_FILE \
--log $RMT_FILE
fi

# extract
if [[ $DO_EXTRACT = 1 ]]; then
NET_FILE="$DATA/$(basename $EMX_FILE .emx)-net.txt"
MINCORR=0
MAXCORR=1

kinc run extract \
--emx $EMX_FILE \
--ccm $CCM_FILE \
--cmx $CMX_FILE \
--output $NET_FILE \
--mincorr $MINCORR \
--maxcorr $MAXCORR
fi
Loading

0 comments on commit c9a3177

Please sign in to comment.