diff --git a/README.md b/README.md index 4e4738c..dcc6a36 100644 --- a/README.md +++ b/README.md @@ -9,47 +9,61 @@ Correlation - Spearman Clustering -- K-means - Gaussian mixture models Thresholding +- Power-law - Random matrix theory -# Installation +KINC is built with [ACE](https://github.com/SystemsGenetics/ACE), a framework which provides mechanisms for large-scale heterogeneous computing and data management. As such, KINC can be run in a variety of compute configurations, including single-core / single-GPU and multi-core / multi-GPU, and KINC uses its own binary file formats to represent the data objects that it produces. Each of these binary formats can be exported to a plain-text format for use in other applications. -This software uses GSL, OpenCL, and [ACE](https://github.com/SystemsGenetics/ACE). For instructions on installing ACE, see the project repository. For all other dependencies, consult your package manager. For example, to install dependencies on Ubuntu: -``` -sudo apt install libgsl2 ocl-icd-opencl-dev libopenmpi-dev -``` +## Installation + +Refer to the files under `docs` for installation instructions. KINC is currently supported on most flavors of Linux. -To build & install KINC: +### Palmetto + +To use KINC on Palmetto, you must add the following modules in lieu of installing dependencies through a package manager: +```bash +module add cuda-toolkit/9.2 +module add gcc/5.4.0 +module add git +module add gsl/2.3 +module add openmpi/1.10.7 +module add Qt/5.9.2 ``` -cd build -qmake ../src/KINC.pro -make qmake_all -make -make qmake_all -make install + +## Usage + +KINC provides two executables: `kinc`, the command-line version, and `qkinc`, the GUI version. The command-line version can use MPI while the GUI version can display data object files that are produced by KINC. KINC produces a gene-coexpression network in several steps: +1. `import-emx`: Import expression matrix text file into binary format +2. `similarity`: Compute a cluster matrix and correlation matrix from expression matrix +3. `threshold`: Determine an appropriate correlation threshold for correlation matrix +4. `extract`: Extract an edge list from a correlation matrix given a threshold + +Below is an example usage of `kinc` on the Yeast dataset: ``` +# import expression matrix into binary format +kinc run import-emx --input Yeast-GEM.txt --output Yeast.emx --nan NA -## Using the KINC GUI or Console +# compute similarity matrix (with GMM clustering) +mpirun -np 8 kinc run similarity --input Yeast.emx --ccm Yeast.ccm --cmx Yeast.cmx --clusmethod gmm --corrmethod spearman --minclus 1 --maxclus 5 -ACE provides two different libraries for GUI and console applications. The `kinc` executable is the console or command line version and the `qkinc` executable is the GUI version. +# determine correlation threshold +kinc run rmt --input Yeast.cmx --log Yeast.log -# Usage +# read threshold from log file +THRESHOLD=$(tail -n 1 Yeast.log) -To build a GCN involves several steps: +# extract network file from thresholded similarity matrix +kinc run extract --emx Yeast.emx --ccm Yeast.ccm --cmx Yeast.cmx --output Yeast-net.txt --mincorr $THRESHOLD +``` -1. Import expression matrix -2. Compute cluster composition matrix -3. Compute correlation matrix -4. Compute thresholded correlation matrix +A more thorough example usage is provided in `scripts/run-all.sh`. -# Troubleshooting -## An error occurred in MPI_Init -KINC requires MPI as a dependency, but on most systems you can execute the command-line KINC as a stand-alone tool without using 'mpirun'. This is because KINC checks during runtime if MPI is appropriate for execution. However, on a SLURM cluster where MPI jobs must be run using the srun command and where PMI2 is compiled into MPI, then KINC cannot be executed stand-alone. It must be executed using srun with the --mpi argument set to pmi2. For example: +### Running KINC on SLURM +Although KINC is an MPI application, generally you can run `kinc` as a stand-alone application without `mpirun` and achieve normal serial behavior. However, on a SLURM cluster where MPI jobs must be run with the `srun` command and where PMI2 is compiled into MPI, `kinc` cannot be executed stand-alone. It must be executed using `srun` with the additional argument `--mpi=pmi2`. For example: ``` srun --mpi=pmi2 kinc run import_emx --input Yeast-ematrix.txt --output Yeast.emx --nan NA ``` - diff --git a/build-tests/.gitignore b/build-tests/.gitignore deleted file mode 100644 index a5baada..0000000 --- a/build-tests/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -* -!.gitignore - diff --git a/docs/Ubuntu_16_04_Setup.md b/docs/Ubuntu_16_04_Setup.md index a9274ae..54d1a45 100644 --- a/docs/Ubuntu_16_04_Setup.md +++ b/docs/Ubuntu_16_04_Setup.md @@ -7,7 +7,7 @@ Use the following steps to setup KINC for development on Ubuntu 16.04: Most of the dependencies are available as packages: ```bash -sudo apt install g++ libgsl-dev libopenblas-dev libopenmpi-dev ocl-icd-opencl-dev +sudo apt install build-essential libgsl-dev libopenblas-dev libopenmpi-dev ocl-icd-opencl-dev ``` For device drivers (AMD, Intel, NVIDIA, etc), refer to the manufacturer's website. @@ -25,7 +25,7 @@ If you install Qt locally then you must add Qt to the executable path: ```bash # append to ~/.bashrc -export QTDIR="$HOME/Qt/5.10.1/gcc_64" +export QTDIR="$HOME/Qt/5.7.1/gcc_64" export PATH="$QTDIR/bin:$PATH" ``` @@ -34,8 +34,8 @@ export PATH="$QTDIR/bin:$PATH" Clone the ACE and KINC repositories from Github. ```bash -git clone git@github.com:SystemsGenetics/ACE.git -git clone git@github.com:SystemsGenetics/KINC.git +git clone https://github.com/SystemsGenetics/ACE.git +git clone https://github.com/SystemsGenetics/KINC.git ``` ## Step 3: Build ACE and KINC @@ -45,6 +45,9 @@ Follow the ACE instructions to build ACE. If you install ACE locally then you mu ```bash # append to ~/.bashrc export INSTALL_PREFIX="$HOME/software" +export PATH="$INSTALL_PREFIX/bin:$PATH" +export CPLUS_INCLUDE_PATH="$INSTALL_PREFIX/include:$CPLUS_INCLUDE_PATH" +export LIBRARY_PATH="$INSTALL_PREFIX/lib:$LIBRARY_PATH" export LD_LIBRARY_PATH="$INSTALL_PREFIX/lib:$LD_LIBRARY_PATH" ``` @@ -52,7 +55,7 @@ Build & install KINC: ```bash cd build -qmake ../src/KINC.pro +qmake ../src/KINC.pro PREFIX=$INSTALL_PREFIX make qmake_all make make qmake_all @@ -63,4 +66,4 @@ You should now be able to run KINC. ## (Optional) Use QtCreator -Select **File** > **Open File or Project** and then navigate in the file browser to the ACE directory and select the ACE.pro file. Navigate through configure setup. Repeat for KINC. +Select __File__ > __Open File or Project__ and then navigate in the file browser to the ACE directory and select the ACE.pro file. Navigate through configure setup. Repeat for KINC. diff --git a/scripts/extract.py b/scripts/extract.py new file mode 100644 index 0000000..c571957 --- /dev/null +++ b/scripts/extract.py @@ -0,0 +1,58 @@ +import argparse +import pandas as pd + + + +if __name__ == "__main__": + # parse command-line arguments + parser = argparse.ArgumentParser() + parser.add_argument("--emx", required=True, help="expression matrix file", dest="EMX") + parser.add_argument("--cmx", required=True, help="correlation matrix file", dest="CMX") + parser.add_argument("-o", "--output", required=True, help="output net file", dest="OUTPUT") + parser.add_argument("--mincorr", type=float, default=0, help="minimum absolute correlation threshold", dest="MINCORR") + parser.add_argument("--maxcorr", type=float, default=1, help="maximum absolute correlation threshold", dest="MAXCORR") + + args = parser.parse_args() + + # load data + emx = pd.read_table(args.EMX) + cmx = pd.read_table(args.CMX, header=None, names=[ + "x", + "y", + "Cluster", + "Num_Clusters", + "Cluster_Samples", + "Missing_Samples", + "Cluster_Outliers", + "Pair_Outliers", + "Too_Low", + "sc", + "Samples" + ]) + + # extract correlations within thresholds + cmx = cmx[(args.MINCORR <= abs(cmx["sc"])) & (abs(cmx["sc"]) <= args.MAXCORR)] + + # insert additional columns used in netlist format + cmx.insert(len(cmx.columns), "Source", [emx.index[x] for x in cmx["x"]]) + cmx.insert(len(cmx.columns), "Target", [emx.index[y] for y in cmx["y"]]) + cmx.insert(len(cmx.columns), "Interaction", ["co" for idx in cmx.index]) + + # reorder columns to netlist format + cmx = cmx[[ + "Source", + "Target", + "sc", + "Interaction", + "Cluster", + "Num_Clusters", + "Cluster_Samples", + "Missing_Samples", + "Cluster_Outliers", + "Pair_Outliers", + "Too_Low", + "Samples" + ]] + + # save output data + cmx.to_csv(args.OUTPUT, sep="\t", index=False) diff --git a/scripts/run-all-py.sh b/scripts/run-all-py.sh new file mode 100755 index 0000000..3e21d89 --- /dev/null +++ b/scripts/run-all-py.sh @@ -0,0 +1,73 @@ +#!/bin/bash + +# parse command-line arguments +if [[ $# != 1 ]]; then + echo "usage: $0 " + exit -1 +fi + +# define analytic flags +DO_SIMILARITY=1 +DO_THRESHOLD=1 +DO_EXTRACT=1 + +# define input/output files +DATA="data" +EMX_FILE="$1" +CMX_FILE="$DATA/$(basename $EMX_FILE .txt)-cmx-py.txt" +NET_FILE="$DATA/$(basename $EMX_FILE .txt)-net-py.txt" + +# similarity +if [[ $DO_SIMILARITY = 1 ]]; then + CLUSMETHOD="gmm" + CORRMETHOD="pearson" + MINEXPR="-inf" + MINCLUS=1 + MAXCLUS=5 + CRITERION="bic" + PREOUT="--preout" + POSTOUT="--postout" + MINCORR=0 + MAXCORR=1 + + python scripts/similarity.py \ + -i $EMX_FILE \ + -o $CMX_FILE \ + --clusmethod $CLUSMETHOD \ + --corrmethod $CORRMETHOD \ + --minexpr=$MINEXPR \ + --minclus $MINCLUS --maxclus $MAXCLUS \ + --crit $CRITERION \ + $PREOUT $POSTOUT \ + --mincorr $MINCORR --maxcorr $MAXCORR +fi + +# threshold +if [[ $DO_THRESHOLD = 1 ]]; then + NUM_GENES=$(expr $(cat $EMX_FILE | wc -l) - 1) + METHOD="rmt" + TSTART=0.99 + TSTEP=0.001 + TSTOP=0.50 + + python scripts/threshold.py \ + -i $CMX_FILE \ + --genes $NUM_GENES \ + --method $METHOD \ + --tstart $TSTART \ + --tstep $TSTEP \ + --tstop $TSTOP +fi + +# extract +if [[ $DO_EXTRACT = 1 ]]; then + MINCORR=0 + MAXCORR=1 + + python scripts/extract.py \ + --emx $EMX_FILE \ + --cmx $CMX_FILE \ + --output $NET_FILE \ + --mincorr $MINCORR \ + --maxcorr $MAXCORR +fi diff --git a/scripts/run-all.sh b/scripts/run-all.sh new file mode 100755 index 0000000..0fc3e0c --- /dev/null +++ b/scripts/run-all.sh @@ -0,0 +1,108 @@ +#!/bin/bash + +# parse command-line arguments +if [[ $# != 1 ]]; then + echo "usage: $0 " + exit -1 +fi + +GPU=1 + +# define analytic flags +DO_IMPORT_EMX=1 +DO_SIMILARITY=1 +DO_EXPORT_CMX=1 +DO_THRESHOLD=1 +DO_EXTRACT=1 + +# define input/output files +INFILE="$1" +DATA="data" +EMX_FILE="$DATA/$(basename $INFILE .txt).emx" +CCM_FILE="$DATA/$(basename $EMX_FILE .emx).ccm" +CMX_FILE="$DATA/$(basename $EMX_FILE .emx).cmx" +LOGS="logs" +RMT_FILE="$LOGS/$(basename $CMX_FILE .cmx).txt" + +# apply settings +if [[ $GPU == 1 ]]; then + kinc settings set opencl 0:0 + kinc settings set threads 4 + kinc settings set logging off + + NP=1 +else + kinc settings set opencl none + kinc settings set logging off + + NP=$(nproc) +fi + +# import emx +if [[ $DO_IMPORT_EMX = 1 ]]; then + kinc run import-emx \ + --input $INFILE \ + --output $EMX_FILE \ + --nan NA +fi + +# similarity +if [[ $DO_SIMILARITY = 1 ]]; then + CLUSMETHOD="gmm" + CORRMETHOD="pearson" + MINEXPR="-inf" + MINCLUS=1 + MAXCLUS=5 + CRITERION="BIC" + PREOUT="--preout" + POSTOUT="--postout" + MINCORR=0.5 + MAXCORR=1 + + mpirun -np $NP kinc run similarity \ + --input $EMX_FILE \ + --ccm $CCM_FILE \ + --cmx $CMX_FILE \ + --clusmethod $CLUSMETHOD \ + --corrmethod $CORRMETHOD \ + --minexpr $MINEXPR \ + --minclus $MINCLUS --maxclus $MAXCLUS \ + --crit $CRITERION \ + $PREOUT $POSTOUT \ + --mincorr $MINCORR --maxcorr $MAXCORR +fi + +# export cmx +if [[ $DO_EXPORT_CMX = 1 ]]; then + OUTFILE="$DATA/$(basename $CMX_FILE .cmx)-cmx.txt" + + kinc run export-cmx \ + --emx $EMX_FILE \ + --ccm $CCM_FILE \ + --cmx $CMX_FILE \ + --output $OUTFILE +fi + +# threshold +if [[ $DO_THRESHOLD = 1 ]]; then + mkdir -p $LOGS + + kinc run rmt \ + --input $CMX_FILE \ + --log $RMT_FILE +fi + +# extract +if [[ $DO_EXTRACT = 1 ]]; then + NET_FILE="$DATA/$(basename $EMX_FILE .emx)-net.txt" + MINCORR=0 + MAXCORR=1 + + kinc run extract \ + --emx $EMX_FILE \ + --ccm $CCM_FILE \ + --cmx $CMX_FILE \ + --output $NET_FILE \ + --mincorr $MINCORR \ + --maxcorr $MAXCORR +fi diff --git a/scripts/similarity.py b/scripts/similarity.py new file mode 100644 index 0000000..e88f9bf --- /dev/null +++ b/scripts/similarity.py @@ -0,0 +1,230 @@ +import argparse +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import pprint +import scipy.stats +import seaborn as sns +import sklearn.cluster +import sklearn.mixture + + + +def create_gmm(n_clusters): + return sklearn.mixture.GaussianMixture(n_components=n_clusters) + + + +def create_kmeans(n_clusters): + return sklearn.cluster.KMeans(n_clusters=n_clusters, n_jobs=-1) + + + +def fetch_pair(emx, i, j, min_expression): + # extract pairwise data + X = emx.iloc[[i, j]].values.T + + # initialize labels + y = np.zeros((X.shape[0],), dtype=int) + + # mark thresholded samples + y[(X[:, 0] < min_expression) | (X[:, 1] < min_expression)] = -6 + + # mark nan samples + y[np.isnan(X[:, 0]) | np.isnan(X[:, 1])] = -9 + + return (X, y) + + + +def mark_outliers(X, labels, k, marker): + # extract samples in cluster k + mask = (labels == k) + x = np.copy(X[mask, 0]) + y = np.copy(X[mask, 1]) + + # make sure cluster is not empty + if len(x) == 0 or len(y) == 0: + return + + # sort arrays + x.sort() + y.sort() + + # compute quartiles and thresholds for each axis + n = len(x) + + Q1_x = x[n * 1 // 4] + Q3_x = x[n * 3 // 4] + T_x_min = Q1_x - 1.5 * (Q3_x - Q1_x) + T_x_max = Q3_x + 1.5 * (Q3_x - Q1_x) + + Q1_y = y[n * 1 // 4] + Q3_y = y[n * 3 // 4] + T_y_min = Q1_y - 1.5 * (Q3_y - Q1_y) + T_y_max = Q3_y + 1.5 * (Q3_y - Q1_y) + + # mark outliers + for i in range(len(labels)): + if labels[i] == k: + outlier_x = (X[i, 0] < T_x_min or T_x_max < X[i, 0]) + outlier_y = (X[i, 1] < T_y_min or T_y_max < X[i, 1]) + + if outlier_x or outlier_y: + labels[i] = marker + + + +def compute_clustering(X, y, create_model, min_samples, min_clusters, max_clusters, criterion): + # extract clean pairwise data + mask = (y == 0) + X_clean = X[mask] + N = X_clean.shape[0] + + # make sure there are enough samples + K = 0 + + if N >= min_samples: + # initialize clustering models + models = [create_model(K) for K in range(min_clusters, max_clusters+1)] + min_crit = float("inf") + + # identify number of clusters + for model in models: + # fit model + model.fit(X_clean) + + # compute criterion value + if criterion == "aic": + crit = model.aic(X_clean) + elif criterion == "bic": + crit = model.bic(X_clean) + + # save the best model + if crit < min_crit: + min_crit = crit + K = len(model.weights_) + y[mask] = model.predict(X_clean) + + return K, y + + + +def compute_correlation(X, y, k, method, min_samples, visualize): + # extract samples in cluster k + X_k = X[y == k] + + # make sure there are enough samples + if X_k.shape[0] < min_samples: + return None, None + + # compute correlation + corr, p = method(X_k[:, 0], X_k[:, 1]) + + # plot results + if visualize: + sns.jointplot(x=X_k[:, 0], y=X_k[:, 1], kind="reg", stat_func=method) + plt.show() + + return corr, p + + + +if __name__ == "__main__": + # define clustering methods + CLUSTERING_METHODS = { + "none": None, + "gmm": create_gmm, + "kmeans": create_kmeans + } + + # define correlation methods + CORRELATION_METHODS = { + "kendall": scipy.stats.kendalltau, + "pearson": scipy.stats.pearsonr, + "spearman": scipy.stats.spearmanr + } + + # parse command-line arguments + parser = argparse.ArgumentParser() + parser.add_argument("-i", "--input", required=True, help="expression matrix file", dest="INPUT") + parser.add_argument("-o", "--output", required=True, help="correlation file", dest="OUTPUT") + parser.add_argument("--clusmethod", default="none", choices=["none", "gmm", "kmeans"], help="clustering method", dest="CLUSMETHOD") + parser.add_argument("--corrmethod", default="pearson", choices=["kendall", "pearson", "spearman"], help="correlation method", dest="CORRMETHOD") + parser.add_argument("--minexpr", type=float, default=-float("inf"), help="minimum expression threshold", dest="MINEXPR") + parser.add_argument("--minsamp", type=int, default=30, help="minimum sample size", dest="MINSAMP") + parser.add_argument("--minclus", type=int, default=1, help="minimum clusters", dest="MINCLUS") + parser.add_argument("--maxclus", type=int, default=5, help="maximum clusters", dest="MAXCLUS") + parser.add_argument("--crit", default="bic", choices=["aic", "bic"], help="model selection criterion", dest="CRITERION") + parser.add_argument("--preout", action="store_true", help="whether to remove pre-clustering outliers", dest="PREOUT") + parser.add_argument("--postout", action="store_true", help="whether to remove post-clustering outliers", dest="POSTOUT") + parser.add_argument("--mincorr", type=float, default=0, help="minimum absolute correlation threshold", dest="MINCORR") + parser.add_argument("--maxcorr", type=float, default=1, help="maximum absolute correlation threshold", dest="MAXCORR") + parser.add_argument("--pvalue", type=float, default=float("inf"), help="maximum p-value threshold for correlations", dest="MAXPVALUE") + parser.add_argument("--visualize", action="store_true", help="whether to visualize results", dest="VISUALIZE") + + args = parser.parse_args() + + # print arguments + pprint.pprint(vars(args)) + + # load data + emx = pd.read_table(args.INPUT) + cmx = open(args.OUTPUT, "w"); + + # iterate through each pair + for i in range(len(emx.index)): + for j in range(i): + # fetch pairwise input data + X, y = fetch_pair(emx, i, j, args.MINEXPR) + + # remove pre-clustering outliers + if args.PREOUT: + mark_outliers(X, y, 0, -7) + + # perform clustering + K = 1 + + if args.CLUSMETHOD != "none": + K, y = compute_clustering(X, y, CLUSTERING_METHODS[args.CLUSMETHOD], args.MINSAMP, args.MINCLUS, args.MAXCLUS, args.CRITERION) + + print("%4d %4d %d" % (i, j, K)) + + # remove post-clustering outliers + if K > 1 and args.POSTOUT: + for k in range(K): + mark_outliers(X, y, k, -8) + + # perform correlation + correlations = [compute_correlation(X, y, k, CORRELATION_METHODS[args.CORRMETHOD], args.MINSAMP, args.VISUALIZE) for k in range(K)] + + # save correlation matrix + valid = [(corr != None and args.MINCORR <= abs(corr) and abs(corr) <= args.MAXCORR and p <= args.MAXPVALUE) for corr, p in correlations] + num_clusters = sum(valid) + cluster_idx = 0 + + for k in range(K): + corr, p = correlations[k] + + # make sure correlation, p-value meets thresholds + if valid[k]: + # compute sample mask + y_k = np.copy(y) + y_k[(y_k >= 0) & (y_k != k)] = 0 + y_k[y_k == k] = 1 + y_k[y_k < 0] *= -1 + + sample_mask = "".join([str(y_i) for y_i in y_k]) + + # compute summary statistics + num_samples = sum(y_k == 1) + num_threshold = sum(y_k == 6) + num_preout = sum(y_k == 7) + num_postout = sum(y_k == 8) + num_missing = sum(y_k == 9) + + # write correlation to file + cmx.write("%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%0.8f\t%s\n" % (i, j, cluster_idx, num_clusters, num_samples, num_missing, num_postout, num_preout, num_threshold, corr, sample_mask)) + + # increment cluster index + cluster_idx += 1 diff --git a/scripts/test-gmm.py b/scripts/test-gmm.py deleted file mode 100644 index ca00d68..0000000 --- a/scripts/test-gmm.py +++ /dev/null @@ -1,59 +0,0 @@ -import matplotlib.pyplot as plt -import pandas as pd -import sklearn.mixture -import sys - - - -if __name__ == "__main__": - if len(sys.argv) != 2: - print "usage: python test-gmm.py [infile]" - sys.exit(1) - - # load data - emx = pd.read_csv(sys.argv[1], sep="\t") - - # iterate through each pair - for i in xrange(len(emx.index)): - for j in xrange(i): - # extract pairwise data - X = emx.iloc[[i, j]].dropna(axis=1, how="any") - X = X.values.T - N = X.shape[0] - - # make sure there are enough samples - min_K = 0 - min_crit = float("inf") - - if N >= 30: - # initialize clustering models - models = [sklearn.mixture.GaussianMixture(n_components=n+1) for n in xrange(5)] - - # identify number of clusters - for k, model in enumerate(models): - # fit model - model.fit(X) - - # save the best model - crit = model.aic(X) - if crit < min_crit: - min_K = len(model.weights_) - min_crit = crit - - # plot clustering results - plt.subplots(1, len(models), True, True, figsize=(5 * len(models), 5)) - - for k, model in enumerate(models): - K = len(model.weights_) - crit = model.aic(X) - y = model.predict(X) - - plt.subplot(1, len(models), k + 1) - plt.scatter(X[:, 0], X[:, 1], s=20, c=y, cmap="brg") - plt.title("N = %d, K = %d, crit = %g" % (N, K, crit)) - plt.xlabel(emx.index[i]) - plt.ylabel(emx.index[j]) - - plt.show() - - print "%4d %4d: N = %4d, K = %4d" % (i, j, N, min_K) diff --git a/scripts/test-vbgmm.py b/scripts/test-vbgmm.py index d8ac6ba..743c97f 100644 --- a/scripts/test-vbgmm.py +++ b/scripts/test-vbgmm.py @@ -1,47 +1,129 @@ import matplotlib.pyplot as plt +import numpy as np import pandas as pd import sklearn.mixture import sys +def fetch_pair(emx, i, j, min_expression): + # extract pairwise data + X = emx.iloc[[i, j]].values.T + + # initialize labels + y = np.zeros((X.shape[0],), dtype=int) + + # mark thresholded samples + y[(X[:, 0] < min_expression) | (X[:, 1] < min_expression)] = -6 + + # mark nan samples + y[np.isnan(X[:, 0]) | np.isnan(X[:, 1])] = -9 + + return X, y + + + +def compute_gmm(X, n_components): + # initialize clustering model + model = sklearn.mixture.GaussianMixture(n_components) + + # fit clustering model + model.fit(X) + + # save clustering results + K = n_components + y = model.predict(X) + + # compute criterion value + crit = model.bic(X) + + # print results + print("%4d %4d: %8s: K = %4d, crit = %g" % (i, j, "GMM", K, crit)) + + return K, y, crit + + + +def compute_vbgmm(X, n_components, weight_concentration_prior, weight_threshold): + # initialize clustering model + model = sklearn.mixture.BayesianGaussianMixture(n_components, weight_concentration_prior=weight_concentration_prior) + + # fit clustering model + model.fit(X) + + print("".join(["%8.3f" % w for w in model.weights_])) + + # compute number of effective components + K = sum([(w > weight_threshold) for w in model.weights_]) + + # save clustering results + y = model.predict(X) + + # print results + print("%4d %4d: %8s: y_0 = %g, K = %d" % (i, j, "VBGMM", weight_concentration_prior, K)) + + return K, y + + + if __name__ == "__main__": if len(sys.argv) != 2: - print "usage: python test-vbgmm.py [infile]" + print("usage: python test-vbgmm.py [infile]") sys.exit(1) + # define parameters + min_expression = float("-inf") + min_samples = 30 + min_clusters = 1 + max_clusters = 5 + weight_concentration_priors = [1e-6, 1e-3, 1e0, 1e3, 1e6] + weight_threshold = 0.05 + # load data - emx = pd.read_csv(sys.argv[1], sep="\t") + emx = pd.read_table(sys.argv[1]) # iterate through each pair - for i in xrange(len(emx.index)): - for j in xrange(i): - # extract pairwise data - X = emx.iloc[[i, j]].dropna(axis=1, how="any") - X = X.values.T - N = X.shape[0] + for i in range(len(emx.index)): + for j in range(i): + # extract clean pairwise data + X, y = fetch_pair(emx, i, j, min_expression) + X = X[y == 0] + + if len(X) < min_samples: + continue + + # compute Gaussian mixture models + gmms = [] + + for n_components in range(min_clusters, max_clusters + 1): + gmms.append(compute_gmm(X, n_components)) - # make sure there are enough samples - K = 0 + # compute variational Bayesian Gaussian mixture models + vbgmms = [] - if N >= 30: - # initialize clustering model - model = sklearn.mixture.BayesianGaussianMixture(n_components=5, weight_concentration_prior=1e3) + for weight_concentration_prior in weight_concentration_priors: + vbgmms.append(compute_vbgmm(X, max_clusters, weight_concentration_prior, weight_threshold)) - # fit clustering model - model.fit(X) + # plot comparison of GMMs and VBGMMs + rows, cols = 2, max(len(gmms), len(vbgmms)) + plt.figure(figsize=(5 * cols, 5 * rows)) - print "".join(["%8.3f" % w for w in model.weights_]) + for k in range(len(gmms)): + K, y, crit = gmms[k] + + plt.subplot(rows, cols, k + 1) + plt.scatter(X[:, 0], X[:, 1], s=20, c=y, cmap="brg") + plt.title("GMM: K = %d, crit = %g" % (K, crit)) + plt.xlabel(emx.index[i]) + plt.ylabel(emx.index[j]) - # compute number of effective components - K = sum([1 for w in model.weights_ if w > 0.05]) + for k in range(len(vbgmms)): + K, y = vbgmms[k] - # plot clustering results - y = model.predict(X) + plt.subplot(rows, cols, cols + k + 1) plt.scatter(X[:, 0], X[:, 1], s=20, c=y, cmap="brg") - plt.title("N = %d, K = %d" % (N, K)) + plt.title("VBGMM: y_0 = %.0e, K = %d" % (weight_concentration_priors[k], K)) plt.xlabel(emx.index[i]) plt.ylabel(emx.index[j]) - plt.show() - print "%4d %4d: N = %4d, K = %4d" % (i, j, N, K) + plt.show() diff --git a/scripts/threshold.py b/scripts/threshold.py new file mode 100644 index 0000000..7ba4b0e --- /dev/null +++ b/scripts/threshold.py @@ -0,0 +1,273 @@ +import argparse +import math +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import pprint +import scipy.interpolate +import scipy.stats +import seaborn as sns +import sklearn.cluster +import sklearn.mixture +import sys + + + +def load_cmx(filename, num_genes, num_clusters): + netlist = pd.read_table(args.INPUT, header=None) + cmx = np.zeros((num_genes * num_clusters, num_genes * num_clusters), dtype=np.float32) + + for idx in range(len(netlist.index)): + i = netlist.iloc[idx, 0] + j = netlist.iloc[idx, 1] + k = netlist.iloc[idx, 2] + r = netlist.iloc[idx, 9] + + cmx[i * num_clusters + k, j * num_clusters + k] = r + cmx[j * num_clusters + k, i * num_clusters + k] = r + + return cmx + + + +def powerlaw(args): + # load correlation matrix + S = load_cmx(args.INPUT, args.NUM_GENES, args.MAX_CLUSTERS) + + # iterate until network is sufficiently scale-free + threshold = args.TSTART + + while True: + # compute thresholded adjacency matrix + A = (abs(S) >= threshold) + + # compute degree of each node + for i in range(A.shape[0]): + A[i, i] = 0 + + degrees = np.array([sum(A[i]) for i in range(A.shape[0])]) + + # compute degree distribution + bins = max(5, degrees.max()) + hist, _ = np.histogram(degrees, bins=bins, range=(1, bins)) + bin_edges = range(1, len(hist) + 1) + + # modify histogram values to work with loglog plot + hist += 1 + + # plot degree distribution + if args.VISUALIZE: + plt.subplots(1, 2, figsize=(10, 5)) + plt.subplot(121) + plt.plot(bin_edges, hist, "ko") + plt.subplot(122) + plt.loglog(bin_edges, hist, "ko") + plt.savefig("plots/powerlaw/%03d.png" % (int(threshold * 100))) + plt.close() + + # compute correlation + x = np.log(bin_edges) + y = np.log(hist) + + r, p = scipy.stats.pearsonr(x, y) + + # output results of threshold test + print("%g\t%g\t%g" % (threshold, r, p)) + + # break if power law is satisfied + if r < 0 and p < 1e-20: + break + + # decrement threshold and fail if minimum threshold is reached + threshold -= args.TSTEP + if threshold < args.TSTOP: + print("error: could not find an adequate threshold above stopping threshold") + sys.exit(0) + + return threshold + + + +def compute_pruned_matrix(S, threshold): + S_pruned = np.copy(S) + S_pruned[abs(S) < threshold] = 0 + S_pruned = S_pruned[~np.all(S_pruned == 0, axis=1)] + S_pruned = S_pruned[:, ~np.all(S_pruned == 0, axis=0)] + + return S_pruned + + + +def compute_degenerate(eigens): + unique = [] + + for i in range(len(eigens)): + if len(unique) == 0 or abs(eigens[i] - unique[-1]) > 1e-6: + unique.append(eigens[i]) + + return unique + + + +def compute_spacings(eigens, pace): + # extract eigenvalues for spline based on pace + x = eigens[::pace] + y = np.linspace(0, 1, len(x)) + + # compute spline + spl = scipy.interpolate.splrep(x, y) + + # extract interpolated eigenvalues from spline + spline_eigens = scipy.interpolate.splev(eigens, spl) + + # compute spacings between interpolated eigenvalues + spacings = np.empty(len(eigens) - 1) + + for i in range(len(spacings)): + spacings[i] = (spline_eigens[i + 1] - spline_eigens[i]) * len(eigens) + + return spacings + + + +def compute_chi_square_pace(eigens, pace): + # compute eigenvalue spacings + spacings = compute_spacings(eigens, pace) + + # compute nearest-neighbor spacing distribution + hist_min = 0.0 + hist_max = 3.0 + num_bins = 60 + bin_width = (hist_max - hist_min) / num_bins + + hist, _ = np.histogram(spacings, num_bins, (hist_min, hist_max)) + + # compote chi-square value from nnsd + chi = 0 + + for i in range(len(hist)): + # compute O_i, the number of elements in bin i + O_i = hist[i] + + # compute E_i, the expected value of Poisson distribution for bin i + E_i = (math.exp(-i * bin_width) - math.exp(-(i + 1) * bin_width)) * len(eigens) + + # update chi-square value based on difference between O_i and E_i + chi += (O_i - E_i) * (O_i - E_i) / E_i + + print("pace: %d, chi: %g" % (pace, chi)) + + return chi + + + +def compute_chi_square(eigens): + # compute unique eigenvalues + unique = compute_degenerate(eigens) + + print("eigenvalues: %d" % len(eigens)) + print("unique eigenvalues: %d" % len(unique)) + + # make sure there are enough eigenvalues + if len(unique) < 50: + return -1 + + # perform several chi-square tests by varying the pace + chi = 0 + num_tests = 0 + + for pace in range(10, 41): + # make sure there are enough eigenvalues for pace + if len(unique) / pace < 5: + break + + chi += compute_chi_square_pace(unique, pace) + num_tests += 1 + + # compute average of chi-square tests + chi /= num_tests + + # return chi value + return chi + + + +def rmt(args): + # load correlation matrix + S = load_cmx(args.INPUT, args.NUM_GENES, args.MAX_CLUSTERS) + + # iterate until chi value goes below 99.607 then above 200 + final_threshold = 0 + final_chi = float("inf") + max_chi = -float("inf") + threshold = args.TSTART + + while max_chi < 200: + # compute pruned matrix + S_pruned = compute_pruned_matrix(S, threshold) + + # make sure pruned matrix is not empty + chi = -1 + + if S_pruned.shape[0] > 0: + # compute eigenvalues of pruned matrix + eigens, _ = np.linalg.eigh(S_pruned) + + # compute chi-square value from NNSD of eigenvalues + chi = compute_chi_square(eigens) + + # make sure chi-square test succeeded + if chi != -1: + # save most recent chi-square value less than critical value + if chi < 99.607: + final_chi = chi + final_threshold = threshold + + # save largest chi-square value which occurs after final_chi + if final_chi < 99.607 and chi > final_chi: + max_chi = chi + + # output results of threshold test + print("%f\t%d\t%f" % (threshold, S_pruned.shape[0], chi)) + + # decrement threshold and fail if minimum threshold is reached + threshold -= args.TSTEP + if threshold < args.TSTOP: + print("error: could not find an adequate threshold above stopping threshold") + sys.exit(0) + + return final_threshold + + + +if __name__ == "__main__": + # define threshold methods + METHODS = { + "powerlaw": powerlaw, + "rmt": rmt + } + + # parse command-line arguments + parser = argparse.ArgumentParser() + parser.add_argument("-i", "--input", required=True, help="correlation matrix file", dest="INPUT") + parser.add_argument("--genes", type=int, required=True, help="number of genes", dest="NUM_GENES") + parser.add_argument("--method", default="rmt", choices=["powerlaw", "rmt"], help="thresholding method", dest="METHOD") + parser.add_argument("--tstart", type=float, default=0.99, help="starting threshold", dest="TSTART") + parser.add_argument("--tstep", type=float, default=0.001, help="threshold step size", dest="TSTEP") + parser.add_argument("--tstop", type=float, default=0.5, help="stopping threshold", dest="TSTOP") + parser.add_argument("--minclus", type=int, default=1, help="minimum clusters", dest="MIN_CLUSTERS") + parser.add_argument("--maxclus", type=int, default=5, help="maximum clusters", dest="MAX_CLUSTERS") + parser.add_argument("--visualize", action="store_true", help="whether to visualize results", dest="VISUALIZE") + + args = parser.parse_args() + + # print arguments + pprint.pprint(vars(args)) + + # load data + cmx = pd.read_table(args.INPUT) + + # initialize method + compute_threshold = METHODS[args.METHOD] + + print(compute_threshold(args)) diff --git a/scripts/validate.py b/scripts/validate.py new file mode 100644 index 0000000..1cef709 --- /dev/null +++ b/scripts/validate.py @@ -0,0 +1,108 @@ +import argparse +import numpy as np +import pandas as pd + + + +def pairwise_error(pair_true, pair_test, K, column_idx): + error = 0.0 + + for k in range(K): + x_true = pair_true.iloc[k, column_idx] + x_test = pair_test.iloc[k, column_idx] + + error += abs(x_true - x_test) / K + + return error + + + +if __name__ == "__main__": + # parse command-line arguments + parser = argparse.ArgumentParser() + parser.add_argument("--true", required=True, help="true correlation file", dest="CMX_TRUE") + parser.add_argument("--test", required=True, help="test correlation file", dest="CMX_TEST") + + args = parser.parse_args() + + # load input data + cmx_true = pd.read_table(args.CMX_TRUE, header=None, index_col=False) + cmx_test = pd.read_table(args.CMX_TEST, header=None, index_col=False) + + # compore number of pairs + print("Number of pairs (true): %d" % len(cmx_true.index)) + print("Number of pairs (test): %d" % len(cmx_test.index)) + + # get list of all pairs + pairs_true = [(cmx_true.iloc[idx, 0], cmx_true.iloc[idx, 1]) for idx in cmx_true.index] + pairs_test = [(cmx_test.iloc[idx, 0], cmx_test.iloc[idx, 1]) for idx in cmx_test.index] + pairs = list(set(pairs_true + pairs_test)) + + pairs.sort() + + # compute pairwise statistics + error_K = 0.0 + error_N_c = 0.0 + error_N_m = 0.0 + error_N_t = 0.0 + error_N_o1 = 0.0 + error_N_o2 = 0.0 + error_r = 0.0 + error_S = 0.0 + + for idx in pairs: + # extract pair from each cmx + pair_true = cmx_true.loc[(cmx_true[0] == idx[0]) & (cmx_true[1] == idx[1])] + pair_test = cmx_test.loc[(cmx_test[0] == idx[0]) & (cmx_test[1] == idx[1])] + + # compute error in number of clusters + K_true = 0 if pair_true.empty else pair_true.iloc[0, 3] + K_test = 0 if pair_test.empty else pair_test.iloc[0, 3] + + error_K += abs(K_true - K_test) / len(pairs) + + # report errors + if K_true != K_test: + print("%4d %4d: %d != %d" % (idx[0], idx[1], K_true, K_test)) + + # use smaller K for cluster-wise comparisons + K = min(K_true, K_test) + + # compute error in clean sample size + error_N_c += pairwise_error(pair_true, pair_test, K, 4) / len(pairs) + + # compute error in missing sample size + error_N_m += pairwise_error(pair_true, pair_test, K, 5) / len(pairs) + + # compute error in thresholded sample size + error_N_t += pairwise_error(pair_true, pair_test, K, 6) / len(pairs) + + # compute error in thresholded sample size + error_N_o1 += pairwise_error(pair_true, pair_test, K, 7) / len(pairs) + + # compute error in thresholded sample size + error_N_o2 += pairwise_error(pair_true, pair_test, K, 8) / len(pairs) + + # compute error in correlation + error_r += pairwise_error(pair_true, pair_test, K, 9) / len(pairs) + + # compute error in sample mask + error_S_pair = 0.0 + + for k in range(K): + S_true = pair_true.iloc[k, 10] + S_test = pair_test.iloc[k, 10] + + error_S_pair += sum([(s_true != s_test) for s_true, s_test in zip(S_true, S_test)]) / len(S_true) / K + + error_S += error_S_pair / len(pairs) + + print("\nError summary:") + print(" Number of clusters: %8.3f" % (error_K)) + print(" Clean sample size: %8.3f" % (error_N_c)) + print(" Missing sample size: %8.3f" % (error_N_m)) + print(" Thresholded sample size: %8.3f" % (error_N_t)) + print(" Pre-outlier sample size: %8.3f" % (error_N_o1)) + print(" Post-outlier sample size: %8.3f" % (error_N_o2)) + print(" Correlation: %8.3f" % (error_r)) + print(" Sample mask: %8.3f" % (error_S)) diff --git a/scripts/visualize.py b/scripts/visualize.py new file mode 100644 index 0000000..e018ebd --- /dev/null +++ b/scripts/visualize.py @@ -0,0 +1,83 @@ +import argparse +import matplotlib.pyplot as plt +import numpy as np +import os +import pandas as pd +import scipy.stats +import seaborn as sns + + + +if __name__ == "__main__": + # parse command-line arguments + parser = argparse.ArgumentParser() + parser.add_argument("-e", "--emx", required=True, help="expression matrix file", dest="EMX") + parser.add_argument("-n", "--netlist", required=True, help="netlist file", dest="NETLIST") + parser.add_argument("-o", "--output", required=True, help="output directory", dest="OUTPUT") + parser.add_argument("-s", "--scale", action="store_true", help="use a uniform global scale", dest="SCALE") + + args = parser.parse_args() + + # load input data + emx = pd.read_table(args.EMX, index_col=0) + netlist = pd.read_table(args.NETLIST) + + print("Loaded expression matrix (%d genes, %d samples)" % emx.shape) + print("Loaded netlist (%d edges)" % len(netlist.index)) + + # setup plot limits + if args.SCALE: + limits = (emx.min().min(), emx.max().max()) + else: + limits = None + + # initialize output directory + if not os.path.exists(args.OUTPUT): + os.mkdir(args.OUTPUT) + + # iterate through each network edge + for idx in netlist.index: + edge = netlist.iloc[idx] + x = edge["Source"] + y = edge["Target"] + k = edge["Cluster"] + + print(x, y, k) + + # extract pairwise data + labels = np.array([int(s) for s in edge["Samples"]]) + mask1 = (labels != 9) + + labels = labels[mask1] + mask2 = (labels == 1) + + data = emx.loc[[x, y]].values[:, mask1] + + # highlight samples in the edge + colors = np.array(["k" for _ in labels]) + colors[mask2] = "r" + + # compute Spearman correlation + r, p = scipy.stats.spearmanr(data[0, mask2], data[1, mask2]) + + # create figure + plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10, 5)) + + # create density plot + plt.subplot(121) + plt.xlim(limits) + plt.ylim(limits) + sns.kdeplot(data[0], data[1], shade=True, shade_lowest=False) + + # create scatter plot + plt.subplot(122) + plt.title("k=%d, samples=%d, spearmanr=%0.2f" % (k, edge["Cluster_Samples"], r)) + plt.xlim(limits) + plt.ylim(limits) + plt.xlabel(x) + plt.ylabel(y) + plt.scatter(data[0], data[1], color="w", edgecolors=colors) + + # save plot to file + plt.savefig("%s/%s_%s_%d.png" % (args.OUTPUT, x, y, k)) + plt.close() diff --git a/src/KINC.pri b/src/KINC.pri index f4f1a1d..65f1e28 100644 --- a/src/KINC.pri +++ b/src/KINC.pri @@ -1,11 +1,9 @@ -# Default settings for MPI CXX include -isEmpty(MPICXX) { MPICXX = "yes" } - # Versions MAJOR_VERSION = 3 MINOR_VERSION = 2 -REVISION = 0 +REVISION = 2 + VERSION = $${MAJOR_VERSION}.$${MINOR_VERSION}.$${REVISION} # Version compiler defines @@ -17,24 +15,22 @@ DEFINES += \ # Basic settings QT += core -TEMPLATE = app QMAKE_CXX = mpic++ CONFIG += c++11 -# Compiler defines -DEFINES += QT_DEPRECATED_WARNINGS - -# External libraries -LIBS += -lmpi -equals(MPICXX,"yes") { LIBS += -lmpi_cxx } -LIBS += -lacecore -lOpenCL -lgsl -lgslcblas -L$${PWD}/../build/libs -lkinccore - # Used to ignore useless warnings with OpenCL QMAKE_CXXFLAGS += -Wno-ignored-attributes -# Source files -SOURCES += \ - ../main.cpp \ +# Default settings for MPI CXX include +isEmpty(MPICXX) { MPICXX = "yes" } + +# External libraries +LIBS += \ + -L$${PWD}/../build/libs -lkinccore \ + -lacecore \ + -lgsl -lgslcblas -llapack -llapacke \ + -lOpenCL -lmpi +equals(MPICXX,"yes") { LIBS += -lmpi_cxx } # Resource files RESOURCES += \ diff --git a/src/KINC.pro b/src/KINC.pro index 4dfb103..29395d6 100644 --- a/src/KINC.pro +++ b/src/KINC.pro @@ -1,4 +1,8 @@ +# Minimum Qt version +lessThan(QT_MAJOR_VERSION,5): error("Requires Qt 5") +lessThan(QT_MINOR_VERSION,7): error("Requires Qt 5.7") + # Default setting for GUI isEmpty(GUI) { GUI = "yes" } @@ -8,10 +12,12 @@ TEMPLATE = subdirs # Subdir projects SUBDIRS += \ core \ - cli + cli \ + tests # Dependencies cli.depends = core +tests.depends = core # This is if GUI is enabled equals(GUI,"yes") { diff --git a/src/cli/cli.pro b/src/cli/cli.pro index 27caef9..17610dd 100644 --- a/src/cli/cli.pro +++ b/src/cli/cli.pro @@ -4,6 +4,7 @@ include (../KINC.pri) # Basic settings TARGET = kinc +TEMPLATE = app # External libraries LIBS += -laceconsole @@ -11,6 +12,10 @@ LIBS += -laceconsole # Compiler defines DEFINES += GUI=0 +# Source files +SOURCES += \ + ../main.cpp + # Installation instructions isEmpty(PREFIX) { PREFIX = /usr/local } program.path = $${PREFIX}/bin diff --git a/src/core/analyticfactory.cpp b/src/core/analyticfactory.cpp index 0e409ad..21d4a6c 100644 --- a/src/core/analyticfactory.cpp +++ b/src/core/analyticfactory.cpp @@ -4,6 +4,7 @@ #include "importcorrelationmatrix.h" #include "exportcorrelationmatrix.h" #include "similarity.h" +#include "powerlaw.h" #include "rmt.h" #include "extract.h" @@ -16,8 +17,13 @@ using namespace std; +/*! + * Return the total number of analytic types that this program implements. + */ quint16 AnalyticFactory::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -26,8 +32,15 @@ quint16 AnalyticFactory::size() const +/*! + * Return the display name for the given analytic type. + * + * @param type + */ QString AnalyticFactory::name(quint16 type) const { + EDEBUG_FUNC(this,type); + switch (type) { case ImportExpressionMatrixType: return "Import Expression Matrix"; @@ -35,7 +48,8 @@ QString AnalyticFactory::name(quint16 type) const case ImportCorrelationMatrixType: return "Import Correlation Matrix"; case ExportCorrelationMatrixType: return "Export Correlation Matrix"; case SimilarityType: return "Similarity"; - case RMTType: return "RMT Thresholding"; + case PowerLawType: return "Threshold (Power-law)"; + case RMTType: return "Threshold (RMT)"; case ExtractType: return "Extract Network"; default: return QString(); } @@ -46,8 +60,15 @@ QString AnalyticFactory::name(quint16 type) const +/*! + * Return the command line name for the given analytic type. + * + * @param type + */ QString AnalyticFactory::commandName(quint16 type) const { + EDEBUG_FUNC(this,type); + switch (type) { case ImportExpressionMatrixType: return "import-emx"; @@ -55,6 +76,7 @@ QString AnalyticFactory::commandName(quint16 type) const case ImportCorrelationMatrixType: return "import-cmx"; case ExportCorrelationMatrixType: return "export-cmx"; case SimilarityType: return "similarity"; + case PowerLawType: return "powerlaw"; case RMTType: return "rmt"; case ExtractType: return "extract"; default: return QString(); @@ -66,8 +88,15 @@ QString AnalyticFactory::commandName(quint16 type) const +/*! + * Make and return a new abstract analytic object of the given type. + * + * @param type + */ std::unique_ptr AnalyticFactory::make(quint16 type) const { + EDEBUG_FUNC(this,type); + switch (type) { case ImportExpressionMatrixType: return unique_ptr(new ImportExpressionMatrix); @@ -75,6 +104,7 @@ std::unique_ptr AnalyticFactory::make(quint16 type) const case ImportCorrelationMatrixType: return unique_ptr(new ImportCorrelationMatrix); case ExportCorrelationMatrixType: return unique_ptr(new ExportCorrelationMatrix); case SimilarityType: return unique_ptr(new Similarity); + case PowerLawType: return unique_ptr(new PowerLaw); case RMTType: return unique_ptr(new RMT); case ExtractType: return unique_ptr(new Extract); default: return nullptr; diff --git a/src/core/analyticfactory.h b/src/core/analyticfactory.h index c576795..0df612f 100644 --- a/src/core/analyticfactory.h +++ b/src/core/analyticfactory.h @@ -4,9 +4,17 @@ +/*! + * This class implements the ACE analytic factory for producing new analytic + * objects and giving basic information about all available analytic types. + */ class AnalyticFactory : public EAbstractAnalyticFactory { public: + /*! + * Defines all available analytic types this program implements along with the + * total size. + */ enum Type { ImportExpressionMatrixType = 0 @@ -14,6 +22,7 @@ class AnalyticFactory : public EAbstractAnalyticFactory ,ImportCorrelationMatrixType ,ExportCorrelationMatrixType ,SimilarityType + ,PowerLawType ,RMTType ,ExtractType ,Total diff --git a/src/core/ccmatrix.cpp b/src/core/ccmatrix.cpp index 872496b..7e5959e 100644 --- a/src/core/ccmatrix.cpp +++ b/src/core/ccmatrix.cpp @@ -1,60 +1,20 @@ #include "ccmatrix.h" +#include "ccmatrix_model.h" -using namespace std; -using namespace Pairwise; - - - - - - +/*! + * Return a qt table model that represents this data object as a table. + */ QAbstractTableModel* CCMatrix::model() { - return nullptr; -} + EDEBUG_FUNC(this); - - - - - -QVariant CCMatrix::headerData(int section, Qt::Orientation orientation, int role) const -{ - // orientation is not used - Q_UNUSED(orientation); - - // if role is not display return nothing - if ( role != Qt::DisplayRole ) - { - return QVariant(); - } - - // get genes metadata and make sure it is an array - const EMetadata& genes {geneNames()}; - if ( genes.isArray() ) + if ( !_model ) { - // make sure section is within limits of gene name array - if ( section >= 0 && section < genes.toArray().size() ) - { - // return gene name - return genes.toArray().at(section).toString(); - } + _model = new Model(this); } - - // no gene found return nothing - return QVariant(); -} - - - - - - -int CCMatrix::rowCount(const QModelIndex&) const -{ - return geneSize(); + return _model; } @@ -62,57 +22,24 @@ int CCMatrix::rowCount(const QModelIndex&) const -int CCMatrix::columnCount(const QModelIndex&) const +/*! + * Initialize this cluster matrix with a list of gene names, the max cluster + * size, and a list of sample names. + * + * @param geneNames + * @param maxClusterSize + * @param sampleNames + */ +void CCMatrix::initialize(const EMetaArray& geneNames, int maxClusterSize, const EMetaArray& sampleNames) { - return geneSize(); -} - - - - - + EDEBUG_FUNC(this,&geneNames,maxClusterSize,&sampleNames); -QVariant CCMatrix::data(const QModelIndex &index, int role) const -{ - // if role is not display return nothing - if ( role != Qt::DisplayRole ) - { - return QVariant(); - } - - // if row and column are equal return empty string - if ( index.row() == index.column() ) - { - return ""; - } - - // get constant pair and read in values - const Pair pair(this); - int x {index.row()}; - int y {index.column()}; - if ( y > x ) - { - swap(x,y); - } - pair.read({x,y}); - - // Return value of pair as a string - return pair.toString(); -} - - - - - - -void CCMatrix::initialize(const EMetadata &geneNames, int maxClusterSize, const EMetadata &sampleNames) -{ - // make sure sample names is an array and is not empty - if ( !sampleNames.isArray() || sampleNames.toArray().isEmpty() ) + // make sure sample names is not empty + if ( sampleNames.isEmpty() ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("Domain Error")); - e.setDetails(tr("Sample names metadata is not an array or is empty.")); + e.setDetails(tr("Sample names metadata is empty.")); throw e; } @@ -122,8 +49,8 @@ void CCMatrix::initialize(const EMetadata &geneNames, int maxClusterSize, const setMeta(metaObject); // save sample size and initialize base class - _sampleSize = sampleNames.toArray().size(); - Matrix::initialize(geneNames, maxClusterSize, (_sampleSize + 1) / 2 * sizeof(qint8), DATA_OFFSET); + _sampleSize = sampleNames.size(); + Matrix::initialize(geneNames, maxClusterSize, (_sampleSize + 1) / 2 * sizeof(qint8), SUBHEADER_SIZE); } @@ -131,141 +58,12 @@ void CCMatrix::initialize(const EMetadata &geneNames, int maxClusterSize, const -EMetadata CCMatrix::sampleNames() const +/*! + * Return the list of correlation names in this correlation matrix. + */ +EMetaArray CCMatrix::sampleNames() const { - return meta().toObject().at("samples"); -} - - - - - - -void CCMatrix::Pair::addCluster(int amount) const -{ - // keep adding a new list of sample masks for given amount - while ( amount-- > 0 ) - { - _sampleMasks.append(QVector(_cMatrix->_sampleSize, 0)); - } -} - - - - - - -QString CCMatrix::Pair::toString() const -{ - // if there are no clusters return empty string - if ( _sampleMasks.isEmpty() ) - { - return ""; - } - - // initialize list of strings and iterate through all clusters - QStringList ret; - for (const auto& sampleMask : _sampleMasks) - { - // initialize list of strings for sample mask and iterate through each sample - QString clusterString("("); - for (const auto& sample : sampleMask) - { - // add new sample token as hexadecimal allowing 16 different possible values - switch (sample) - { - case 0: - case 1: - case 2: - case 3: - case 4: - case 5: - case 6: - case 7: - case 8: - case 9: - clusterString.append(QString::number(sample)); - break; - case 10: - clusterString.append("A"); - break; - case 11: - clusterString.append("B"); - break; - case 12: - clusterString.append("C"); - break; - case 13: - clusterString.append("D"); - break; - case 14: - clusterString.append("E"); - break; - case 15: - clusterString.append("F"); - break; - } - } - - // join all cluster string into one string - ret << clusterString.append(')'); - } - - // join all clusters and return as string - return ret.join(','); -} - - - + EDEBUG_FUNC(this); - - -void CCMatrix::Pair::writeCluster(EDataStream &stream, int cluster) -{ - // make sure cluster value is within range - if ( cluster >= 0 && cluster < _sampleMasks.size() ) - { - // write each sample to output stream - auto& samples {_sampleMasks.at(cluster)}; - - for ( int i = 0; i < samples.size(); i += 2 ) - { - qint8 value {(qint8)(samples[i] & 0x0F)}; - - if ( i + 1 < samples.size() ) - { - value |= (samples[i + 1] << 4); - } - - stream << value; - } - } -} - - - - - - -void CCMatrix::Pair::readCluster(const EDataStream &stream, int cluster) const -{ - // make sure cluster value is within range - if ( cluster >= 0 && cluster < _sampleMasks.size() ) - { - // read each sample from input stream - auto& samples {_sampleMasks[cluster]}; - - for ( int i = 0; i < samples.size(); i += 2 ) - { - qint8 value; - stream >> value; - - samples[i] = value & 0x0F; - - if ( i + 1 < samples.size() ) - { - samples[i + 1] = (value >> 4) & 0x0F; - } - } - } + return meta().toObject().at("samples").toArray(); } diff --git a/src/core/ccmatrix.h b/src/core/ccmatrix.h index a19109d..d9d9e69 100644 --- a/src/core/ccmatrix.h +++ b/src/core/ccmatrix.h @@ -4,52 +4,50 @@ +/*! + * This class implements the cluster matrix data object. A cluster matrix is a + * pairwise matrix where each pair-cluster element is a sample mask denoting + * whether a sample belongs in the cluster. The matrix data can be accessed + * using the pairwise iterator for this class. + */ class CCMatrix : public Pairwise::Matrix { Q_OBJECT public: class Pair; +public: virtual QAbstractTableModel* model() override final; - QVariant headerData(int section, Qt::Orientation orientation, int role) const; - int rowCount(const QModelIndex&) const; - int columnCount(const QModelIndex&) const; - QVariant data(const QModelIndex& index, int role) const; - void initialize(const EMetadata& geneNames, int maxClusterSize, const EMetadata& sampleNames); - EMetadata sampleNames() const; +public: + void initialize(const EMetaArray& geneNames, int maxClusterSize, const EMetaArray& sampleNames); + EMetaArray sampleNames() const; + /*! + * Return the number of samples in the cluster matrix. + */ int sampleSize() const { return _sampleSize; } private: + class Model; +private: + /*! + * Write the sub-header to the data object file. + */ virtual void writeHeader() { stream() << _sampleSize; } + /*! + * Read the sub-header from the data object file. + */ virtual void readHeader() { stream() >> _sampleSize; } - static const int DATA_OFFSET {4}; + /*! + * The size (in bytes) of the sub-header. The sub-header consists of the + * sample size. + */ + constexpr static int SUBHEADER_SIZE {4}; + /*! + * The number of samples in each sample mask. + */ qint32 _sampleSize {0}; -}; - - - -class CCMatrix::Pair : public Pairwise::Matrix::Pair -{ -public: - Pair(CCMatrix* matrix): - Matrix::Pair(matrix), - _cMatrix(matrix) - {} - Pair(const CCMatrix* matrix): - Matrix::Pair(matrix), - _cMatrix(matrix) - {} - Pair() = default; - virtual void clearClusters() const { _sampleMasks.clear(); } - virtual void addCluster(int amount = 1) const; - virtual int clusterSize() const { return _sampleMasks.size(); } - virtual bool isEmpty() const { return _sampleMasks.isEmpty(); } - QString toString() const; - const qint8& at(int cluster, int sample) const { return _sampleMasks.at(cluster).at(sample); } - qint8& at(int cluster, int sample) { return _sampleMasks[cluster][sample]; } -private: - virtual void writeCluster(EDataStream& stream, int cluster); - virtual void readCluster(const EDataStream& stream, int cluster) const; - mutable QVector> _sampleMasks; - const CCMatrix* _cMatrix; + /*! + * Pointer to a qt table model for this class. + */ + Model* _model {nullptr}; }; diff --git a/src/core/ccmatrix_model.cpp b/src/core/ccmatrix_model.cpp new file mode 100644 index 0000000..866dfc2 --- /dev/null +++ b/src/core/ccmatrix_model.cpp @@ -0,0 +1,134 @@ +#include "ccmatrix_model.h" +#include "ccmatrix_pair.h" + + + +using namespace std; + + + + + + +/*! + * Construct a table model for a cluster matrix. + * + * @param matrix + */ +CCMatrix::Model::Model(CCMatrix* matrix): + _matrix(matrix) +{ + EDEBUG_FUNC(this,matrix); + + setParent(matrix); +} + + + + + + +/*! + * Return a header name for the table model using a given index. + * + * @param section + * @param orientation + * @param role + */ +QVariant CCMatrix::Model::headerData(int section, Qt::Orientation orientation, int role) const +{ + EDEBUG_FUNC(this,section,orientation,role); + + // orientation is not used + Q_UNUSED(orientation); + + // if role is not display return nothing + if ( role != Qt::DisplayRole ) + { + return QVariant(); + } + + // get gene names + EMetaArray geneNames {_matrix->geneNames()}; + + // make sure section is within limits of gene name array + if ( section >= 0 && section < geneNames.size() ) + { + // return gene name + return geneNames.at(section).toString(); + } + + // no gene found return nothing + return QVariant(); +} + + + + + + +/*! + * Return the number of rows in the table model. + */ +int CCMatrix::Model::rowCount(const QModelIndex&) const +{ + EDEBUG_FUNC(this); + + return _matrix->geneSize(); +} + + + + + + +/*! + * Return the number of columns in the table model. + */ +int CCMatrix::Model::columnCount(const QModelIndex&) const +{ + EDEBUG_FUNC(this); + + return _matrix->geneSize(); +} + + + + + + +/*! + * Return a data element in the table model using the given index. + * + * @param index + * @param role + */ +QVariant CCMatrix::Model::data(const QModelIndex& index, int role) const +{ + EDEBUG_FUNC(this,&index,role); + + // if role is not display return nothing + if ( role != Qt::DisplayRole ) + { + return QVariant(); + } + + // if row and column are equal return empty string + if ( index.row() == index.column() ) + { + return ""; + } + + // get constant pair and read in values + const Pair pair(_matrix); + int x {index.row()}; + int y {index.column()}; + if ( y > x ) + { + swap(x,y); + } + pair.read({x,y}); + + // Return value of pair as a string + return pair.toString(); +} diff --git a/src/core/ccmatrix_model.h b/src/core/ccmatrix_model.h new file mode 100644 index 0000000..419344d --- /dev/null +++ b/src/core/ccmatrix_model.h @@ -0,0 +1,28 @@ +#ifndef CCMATRIX_MODEL_H +#define CCMATRIX_MODEL_H +#include "ccmatrix.h" + + + +/*! + * This class implements the qt table model for the cluster matrix + * data object, which represents the cluster matrix as a table. + */ +class CCMatrix::Model : public QAbstractTableModel +{ +public: + Model(CCMatrix* matrix); + virtual QVariant headerData(int section, Qt::Orientation orientation, int role) const override final; + virtual int rowCount(const QModelIndex& parent) const override final; + virtual int columnCount(const QModelIndex& parent) const override final; + virtual QVariant data(const QModelIndex& index, int role) const override final; +private: + /*! + * Pointer to the data object for this table model. + */ + CCMatrix* _matrix; +}; + + + +#endif diff --git a/src/core/ccmatrix_pair.cpp b/src/core/ccmatrix_pair.cpp new file mode 100644 index 0000000..98ab46a --- /dev/null +++ b/src/core/ccmatrix_pair.cpp @@ -0,0 +1,161 @@ +#include "ccmatrix_pair.h" + + + +/*! + * Add one or more clusters to this pair. + * + * @param amount + */ +void CCMatrix::Pair::addCluster(int amount) const +{ + EDEBUG_FUNC(this,amount); + + // keep adding a new list of sample masks for given amount + while ( amount-- > 0 ) + { + _sampleMasks.append(QVector(_cMatrix->_sampleSize, 0)); + } +} + + + + + + +/*! + * Return the string representation of this pair, which is a comma-delimited + * string of each sample mask in the pair. + */ +QString CCMatrix::Pair::toString() const +{ + EDEBUG_FUNC(this); + + // if there are no clusters return empty string + if ( _sampleMasks.isEmpty() ) + { + return ""; + } + + // initialize list of strings and iterate through all clusters + QStringList ret; + for (const auto& sampleMask : _sampleMasks) + { + // initialize list of strings for sample mask and iterate through each sample + QString clusterString("("); + for (const auto& sample : sampleMask) + { + // add new sample token as hexadecimal allowing 16 different possible values + switch (sample) + { + case 0: + case 1: + case 2: + case 3: + case 4: + case 5: + case 6: + case 7: + case 8: + case 9: + clusterString.append(QString::number(sample)); + break; + case 10: + clusterString.append("A"); + break; + case 11: + clusterString.append("B"); + break; + case 12: + clusterString.append("C"); + break; + case 13: + clusterString.append("D"); + break; + case 14: + clusterString.append("E"); + break; + case 15: + clusterString.append("F"); + break; + } + } + + // join all cluster string into one string + ret << clusterString.append(')'); + } + + // join all clusters and return as string + return ret.join(','); +} + + + + + + +/*! + * Write a cluster in the iterator's pairwise data to the data object file. + * + * @param stream + * @param cluster + */ +void CCMatrix::Pair::writeCluster(EDataStream& stream, int cluster) +{ + EDEBUG_FUNC(this,&stream,cluster); + + // make sure cluster value is within range + if ( cluster >= 0 && cluster < _sampleMasks.size() ) + { + // write each sample to output stream + auto& samples {_sampleMasks.at(cluster)}; + + for ( int i = 0; i < samples.size(); i += 2 ) + { + qint8 value {(qint8)(samples[i] & 0x0F)}; + + if ( i + 1 < samples.size() ) + { + value |= (samples[i + 1] << 4); + } + + stream << value; + } + } +} + + + + + + +/*! + * Read a cluster from the data object file into memory. + * + * @param stream + * @param cluster + */ +void CCMatrix::Pair::readCluster(const EDataStream& stream, int cluster) const +{ + EDEBUG_FUNC(this,&stream,cluster); + + // make sure cluster value is within range + if ( cluster >= 0 && cluster < _sampleMasks.size() ) + { + // read each sample from input stream + auto& samples {_sampleMasks[cluster]}; + + for ( int i = 0; i < samples.size(); i += 2 ) + { + qint8 value; + stream >> value; + + samples[i] = value & 0x0F; + + if ( i + 1 < samples.size() ) + { + samples[i + 1] = (value >> 4) & 0x0F; + } + } + } +} diff --git a/src/core/ccmatrix_pair.h b/src/core/ccmatrix_pair.h new file mode 100644 index 0000000..53e6374 --- /dev/null +++ b/src/core/ccmatrix_pair.h @@ -0,0 +1,47 @@ +#ifndef CCMATRIX_PAIR_H +#define CCMATRIX_PAIR_H +#include "ccmatrix.h" +#include "pairwise_matrix_pair.h" + + + +/*! + * This class implements the pairwise iterator for the cluster matrix data + * object. This class extends the behavior of the base pairwise iterator to read + * and write sample masks. + */ +class CCMatrix::Pair : public Pairwise::Matrix::Pair +{ +public: + Pair(CCMatrix* matrix): + Matrix::Pair(matrix), + _cMatrix(matrix) + {} + Pair(const CCMatrix* matrix): + Matrix::Pair(matrix), + _cMatrix(matrix) + {} + Pair() = default; + virtual void clearClusters() const { _sampleMasks.clear(); } + virtual void addCluster(int amount = 1) const; + virtual int clusterSize() const { return _sampleMasks.size(); } + virtual bool isEmpty() const { return _sampleMasks.isEmpty(); } + QString toString() const; + const qint8& at(int cluster, int sample) const { return _sampleMasks.at(cluster).at(sample); } + qint8& at(int cluster, int sample) { return _sampleMasks[cluster][sample]; } +private: + virtual void writeCluster(EDataStream& stream, int cluster); + virtual void readCluster(const EDataStream& stream, int cluster) const; + /*! + * Array of sample masks for the current pair. + */ + mutable QVector> _sampleMasks; + /*! + * Constant pointer to parent cluster matrix. + */ + const CCMatrix* _cMatrix; +}; + + + +#endif diff --git a/src/core/core.pro b/src/core/core.pro index 8209dd6..a70e023 100644 --- a/src/core/core.pro +++ b/src/core/core.pro @@ -1,3 +1,7 @@ + +# Include common settings +include (../KINC.pri) + # Basic Settings TARGET = kinccore TEMPLATE = lib @@ -6,25 +10,22 @@ CONFIG += staticlib # Build settings DESTDIR = $$PWD/../../build/libs/ -# Qt libraries -QT += core - -# Preprocessor defines -DEFINES += QT_DEPRECATED_WARNINGS - -# Used to ignore useless warnings from OpenCL -QMAKE_CXXFLAGS += -Wno-ignored-attributes - # Source files SOURCES += \ analyticfactory.cpp \ + ccmatrix_model.cpp \ + ccmatrix_pair.cpp \ ccmatrix.cpp \ + correlationmatrix_model.cpp \ + correlationmatrix_pair.cpp \ correlationmatrix.cpp \ datafactory.cpp \ exportcorrelationmatrix_input.cpp \ exportcorrelationmatrix.cpp \ exportexpressionmatrix_input.cpp \ exportexpressionmatrix.cpp \ + expressionmatrix_gene.cpp \ + expressionmatrix_model.cpp \ expressionmatrix.cpp \ extract_input.cpp \ extract.cpp \ @@ -32,21 +33,23 @@ SOURCES += \ importcorrelationmatrix.cpp \ importexpressionmatrix_input.cpp \ importexpressionmatrix.cpp \ - pairwise_clustering.cpp \ - pairwise_correlation.cpp \ + pairwise_clusteringmodel.cpp \ + pairwise_correlationmodel.cpp \ pairwise_gmm.cpp \ pairwise_index.cpp \ - pairwise_kmeans.cpp \ pairwise_linalg.cpp \ + pairwise_matrix_pair.cpp \ pairwise_matrix.cpp \ pairwise_pearson.cpp \ pairwise_spearman.cpp \ + powerlaw_input.cpp \ + powerlaw.cpp \ rmt_input.cpp \ rmt.cpp \ similarity_input.cpp \ similarity_opencl_fetchpair.cpp \ similarity_opencl_gmm.cpp \ - similarity_opencl_kmeans.cpp \ + similarity_opencl_outlier.cpp \ similarity_opencl_pearson.cpp \ similarity_opencl_spearman.cpp \ similarity_opencl_worker.cpp \ @@ -59,13 +62,19 @@ SOURCES += \ # Header files HEADERS += \ analyticfactory.h \ + ccmatrix_model.h \ + ccmatrix_pair.h \ ccmatrix.h \ + correlationmatrix_model.h \ + correlationmatrix_pair.h \ correlationmatrix.h \ datafactory.h \ exportcorrelationmatrix_input.h \ exportcorrelationmatrix.h \ exportexpressionmatrix_input.h \ exportexpressionmatrix.h \ + expressionmatrix_gene.h \ + expressionmatrix_model.h \ expressionmatrix.h \ extract_input.h \ extract.h \ @@ -73,21 +82,23 @@ HEADERS += \ importcorrelationmatrix.h \ importexpressionmatrix_input.h \ importexpressionmatrix.h \ - pairwise_clustering.h \ - pairwise_correlation.h \ + pairwise_clusteringmodel.h \ + pairwise_correlationmodel.h \ pairwise_gmm.h \ pairwise_index.h \ - pairwise_kmeans.h \ pairwise_linalg.h \ + pairwise_matrix_pair.h \ pairwise_matrix.h \ pairwise_pearson.h \ pairwise_spearman.h \ + powerlaw_input.h \ + powerlaw.h \ rmt_input.h \ rmt.h \ similarity_input.h \ similarity_opencl_fetchpair.h \ similarity_opencl_gmm.h \ - similarity_opencl_kmeans.h \ + similarity_opencl_outlier.h \ similarity_opencl_pearson.h \ similarity_opencl_spearman.h \ similarity_opencl_worker.h \ diff --git a/src/core/correlationmatrix.cpp b/src/core/correlationmatrix.cpp index 87b7029..3c32ecf 100644 --- a/src/core/correlationmatrix.cpp +++ b/src/core/correlationmatrix.cpp @@ -1,83 +1,21 @@ #include "correlationmatrix.h" +#include "correlationmatrix_model.h" +#include "correlationmatrix_pair.h" -using namespace std; -using namespace Pairwise; - - - - - - +/*! + * Return a qt table model that represents this data object as a table. + */ QAbstractTableModel* CorrelationMatrix::model() { - return nullptr; -} - - - - - - -QVariant CorrelationMatrix::headerData(int section, Qt::Orientation orientation, int role) const -{ - // orientation is not used - Q_UNUSED(orientation); - - // if role is not display return nothing - if ( role != Qt::DisplayRole ) - { - return QVariant(); - } - - // get genes metadata and make sure it is an array - const EMetadata& genes {geneNames()}; - if ( genes.isArray() ) - { - // make sure section is within limits of gene name array - if ( section >= 0 && section < genes.toArray().size() ) - { - // return gene name - return genes.toArray().at(section).toString(); - } - } - - // no gene found return nothing - return QVariant(); -} - - - - - - -QVariant CorrelationMatrix::data(const QModelIndex& index, int role) const -{ - // if role is not display return nothing - if ( role != Qt::DisplayRole ) - { - return QVariant(); - } - - // if row and column are equal return one - if ( index.row() == index.column() ) - { - return "1"; - } + EDEBUG_FUNC(this); - // get constant pair and read in values - const Pair pair(this); - int x {index.row()}; - int y {index.column()}; - if ( y > x ) + if ( !_model ) { - swap(x,y); + _model = new Model(this); } - pair.read({x,y}); - - // Return value of pair as a string - return pair.toString(); + return _model; } @@ -85,34 +23,24 @@ QVariant CorrelationMatrix::data(const QModelIndex& index, int role) const -int CorrelationMatrix::rowCount(const QModelIndex&) const +/*! + * Initialize this correlation matrix with a list of gene names, the max cluster + * size, and a list of correlation names. + * + * @param geneNames + * @param maxClusterSize + * @param correlationNames + */ +void CorrelationMatrix::initialize(const EMetaArray& geneNames, int maxClusterSize, const EMetaArray& correlationNames) { - return geneSize(); -} - - - + EDEBUG_FUNC(this,&geneNames,maxClusterSize,&correlationNames); - - -int CorrelationMatrix::columnCount(const QModelIndex&) const -{ - return geneSize(); -} - - - - - - -void CorrelationMatrix::initialize(const EMetadata &geneNames, int maxClusterSize, const EMetadata &correlationNames) -{ - // make sure correlation names is an array and is not empty - if ( !correlationNames.isArray() || correlationNames.toArray().isEmpty() ) + // make sure correlation names is not empty + if ( correlationNames.isEmpty() ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("Domain Error")); - e.setDetails(tr("Correlation names metadata is not an array or is empty.")); + e.setDetails(tr("Correlation names metadata is empty.")); throw e; } @@ -122,8 +50,8 @@ void CorrelationMatrix::initialize(const EMetadata &geneNames, int maxClusterSiz setMeta(metaObject); // save correlation size and initialize base class - _correlationSize = correlationNames.toArray().size(); - Matrix::initialize(geneNames, maxClusterSize, _correlationSize * sizeof(float), DATA_OFFSET); + _correlationSize = correlationNames.size(); + Matrix::initialize(geneNames, maxClusterSize, _correlationSize * sizeof(float), SUBHEADER_SIZE); } @@ -131,9 +59,14 @@ void CorrelationMatrix::initialize(const EMetadata &geneNames, int maxClusterSiz -EMetadata CorrelationMatrix::correlationNames() const +/*! + * Return the list of correlation names in this correlation matrix. + */ +EMetaArray CorrelationMatrix::correlationNames() const { - return meta().toObject().at("correlations"); + EDEBUG_FUNC(this); + + return meta().toObject().at("correlations").toArray(); } @@ -141,16 +74,16 @@ EMetadata CorrelationMatrix::correlationNames() const -QVector CorrelationMatrix::dumpRawData() const +/*! + * Return a list of correlation pairs in raw form. + */ +QVector CorrelationMatrix::dumpRawData() const { - // if there are no genes do nothing - if ( geneSize() == 0 ) - { - return QVector(); - } + EDEBUG_FUNC(this); - // create new correlation matrix - QVector data(geneSize() * geneSize() * maxClusterSize()); + // create list of raw pairs + QVector pairs; + pairs.reserve(size()); // iterate through all pairs Pair pair(this); @@ -160,103 +93,18 @@ QVector CorrelationMatrix::dumpRawData() const // read in next pair pair.readNext(); - // load cluster data - int i = pair.index().getX(); - int j = pair.index().getY(); + // copy pair to raw list + RawPair rawPair; + rawPair.index = pair.index(); + rawPair.correlations.resize(pair.clusterSize()); for ( int k = 0; k < pair.clusterSize(); ++k ) { - float correlation = pair.at(k, 0); - - data[i * geneSize() * maxClusterSize() + j * maxClusterSize() + k] = correlation; - data[j * geneSize() * maxClusterSize() + i * maxClusterSize() + k] = correlation; + rawPair.correlations[k] = pair.at(k, 0); } + + pairs.append(rawPair); } - return data; -} - - - - - - -void CorrelationMatrix::Pair::addCluster(int amount) const -{ - // keep adding a new list of floats for given amount - while ( amount-- > 0 ) - { - _correlations.append(QVector(_cMatrix->_correlationSize, NAN)); - } -} - - - - - - -QString CorrelationMatrix::Pair::toString() const -{ - // if there are no correlations simply return null - if ( _correlations.isEmpty() ) - { - return tr(""); - } - - // initialize list of strings and iterate through all clusters - QStringList ret; - for (const auto& cluster : _correlations) - { - // initialize list of strings for cluster and iterate through each correlation - QStringList clusterStrings; - for (const auto& correlation : cluster) - { - // add correlation value as string - clusterStrings << QString::number(correlation); - } - - // join all cluster strings into one string - ret << clusterStrings.join(','); - } - - // join all clusters and return as string - return ret.join(','); -} - - - - - - -void CorrelationMatrix::Pair::writeCluster(EDataStream& stream, int cluster) -{ - // make sure cluster value is within range - if ( cluster >= 0 && cluster < _correlations.size() ) - { - // write correlations per cluster to output stream - for (const auto& correlation : _correlations.at(cluster)) - { - stream << correlation; - } - } -} - - - - - - -void CorrelationMatrix::Pair::readCluster(const EDataStream& stream, int cluster) const -{ - // make sure cluster value is within range - if ( cluster >= 0 && cluster < _correlations.size() ) - { - // read correlations per cluster from input stream - for (int i = 0; i < _cMatrix->_correlationSize ;++i) - { - float value; - stream >> value; - _correlations[cluster][i] = value; - } - } + return pairs; } diff --git a/src/core/correlationmatrix.h b/src/core/correlationmatrix.h index c8a5c6c..718dec3 100644 --- a/src/core/correlationmatrix.h +++ b/src/core/correlationmatrix.h @@ -4,53 +4,52 @@ +/*! + * This class implements the correlation matrix data object. A correlation matrix + * is a pairwise matrix where each pair-cluster element is a correlation value. The + * matrix data can be accessed using the pairwise iterator for this class. + */ class CorrelationMatrix : public Pairwise::Matrix { Q_OBJECT public: class Pair; +public: + struct RawPair + { + Pairwise::Index index; + QVector correlations; + }; +public: virtual QAbstractTableModel* model() override final; - QVariant headerData(int section, Qt::Orientation orientation, int role) const; - int rowCount(const QModelIndex&) const; - int columnCount(const QModelIndex&) const; - QVariant data(const QModelIndex& index, int role) const; - void initialize(const EMetadata& geneNames, int maxClusterSize, const EMetadata& correlationNames); - EMetadata correlationNames() const; - QVector dumpRawData() const; +public: + void initialize(const EMetaArray& geneNames, int maxClusterSize, const EMetaArray& correlationNames); + EMetaArray correlationNames() const; + QVector dumpRawData() const; private: + class Model; +private: + /*! + * Write the sub-header to the data object file. + */ virtual void writeHeader() { stream() << _correlationSize; } + /*! + * Read the sub-header from the data object file. + */ virtual void readHeader() { stream() >> _correlationSize; } - static const int DATA_OFFSET {1}; + /*! + * The size (in bytes) of the sub-header. The sub-header consists of the + * correlation size. + */ + constexpr static int SUBHEADER_SIZE {1}; + /*! + * The number of correlations in each pair-cluster. + */ qint8 _correlationSize {0}; -}; - - - -class CorrelationMatrix::Pair : public Pairwise::Matrix::Pair -{ -public: - Pair(CorrelationMatrix* matrix): - Matrix::Pair(matrix), - _cMatrix(matrix) - {} - Pair(const CorrelationMatrix* matrix): - Matrix::Pair(matrix), - _cMatrix(matrix) - {} - Pair() = default; - virtual void clearClusters() const { _correlations.clear(); } - virtual void addCluster(int amount = 1) const; - virtual int clusterSize() const { return _correlations.size(); } - virtual bool isEmpty() const { return _correlations.isEmpty(); } - QString toString() const; - const float& at(int cluster, int correlation) const - { return _correlations.at(cluster).at(correlation); } - float& at(int cluster, int correlation) { return _correlations[cluster][correlation]; } -private: - virtual void writeCluster(EDataStream& stream, int cluster); - virtual void readCluster(const EDataStream& stream, int cluster) const; - mutable QVector> _correlations; - const CorrelationMatrix* _cMatrix; + /*! + * Pointer to a qt table model for this class. + */ + Model* _model {nullptr}; }; diff --git a/src/core/correlationmatrix_model.cpp b/src/core/correlationmatrix_model.cpp new file mode 100644 index 0000000..696ff6d --- /dev/null +++ b/src/core/correlationmatrix_model.cpp @@ -0,0 +1,134 @@ +#include "correlationmatrix_model.h" +#include "correlationmatrix_pair.h" + + + +using namespace std; + + + + + + +/*! + * Construct a table model for a correlation matrix. + * + * @param matrix + */ +CorrelationMatrix::Model::Model(CorrelationMatrix* matrix): + _matrix(matrix) +{ + EDEBUG_FUNC(this,matrix); + + setParent(matrix); +} + + + + + + +/*! + * Return a header name for the table model using a given index. + * + * @param section + * @param orientation + * @param role + */ +QVariant CorrelationMatrix::Model::headerData(int section, Qt::Orientation orientation, int role) const +{ + EDEBUG_FUNC(this,section,orientation,role); + + // orientation is not used + Q_UNUSED(orientation); + + // if role is not display return nothing + if ( role != Qt::DisplayRole ) + { + return QVariant(); + } + + // get gene names + EMetaArray geneNames {_matrix->geneNames()}; + + // make sure section is within limits of gene name array + if ( section >= 0 && section < geneNames.size() ) + { + // return gene name + return geneNames.at(section).toString(); + } + + // no gene found return nothing + return QVariant(); +} + + + + + + +/*! + * Return the number of rows in the table model. + */ +int CorrelationMatrix::Model::rowCount(const QModelIndex&) const +{ + EDEBUG_FUNC(this); + + return _matrix->geneSize(); +} + + + + + + +/*! + * Return the number of columns in the table model. + */ +int CorrelationMatrix::Model::columnCount(const QModelIndex&) const +{ + EDEBUG_FUNC(this); + + return _matrix->geneSize(); +} + + + + + + +/*! + * Return a data element in the table model using the given index. + * + * @param index + * @param role + */ +QVariant CorrelationMatrix::Model::data(const QModelIndex& index, int role) const +{ + EDEBUG_FUNC(this,&index,role); + + // if role is not display return nothing + if ( role != Qt::DisplayRole ) + { + return QVariant(); + } + + // if row and column are equal return empty string + if ( index.row() == index.column() ) + { + return ""; + } + + // get constant pair and read in values + const Pair pair(_matrix); + int x {index.row()}; + int y {index.column()}; + if ( y > x ) + { + swap(x,y); + } + pair.read({x,y}); + + // Return value of pair as a string + return pair.toString(); +} diff --git a/src/core/correlationmatrix_model.h b/src/core/correlationmatrix_model.h new file mode 100644 index 0000000..4318c1d --- /dev/null +++ b/src/core/correlationmatrix_model.h @@ -0,0 +1,28 @@ +#ifndef CORRELATIONMATRIX_MODEL_H +#define CORRELATIONMATRIX_MODEL_H +#include "correlationmatrix.h" + + + +/*! + * This class implements the qt table model for the correlation matrix + * data object, which represents the correlation matrix as a table. + */ +class CorrelationMatrix::Model : public QAbstractTableModel +{ +public: + Model(CorrelationMatrix* matrix); + virtual QVariant headerData(int section, Qt::Orientation orientation, int role) const override final; + virtual int rowCount(const QModelIndex& parent) const override final; + virtual int columnCount(const QModelIndex& parent) const override final; + virtual QVariant data(const QModelIndex& index, int role) const override final; +private: + /*! + * Pointer to the data object for this table model. + */ + CorrelationMatrix* _matrix; +}; + + + +#endif diff --git a/src/core/correlationmatrix_pair.cpp b/src/core/correlationmatrix_pair.cpp new file mode 100644 index 0000000..7c926f1 --- /dev/null +++ b/src/core/correlationmatrix_pair.cpp @@ -0,0 +1,112 @@ +#include "correlationmatrix_pair.h" + + + +/*! + * Add one or more clusters to this pair. + * + * @param amount + */ +void CorrelationMatrix::Pair::addCluster(int amount) const +{ + EDEBUG_FUNC(this,amount); + + // keep adding a new list of floats for given amount + while ( amount-- > 0 ) + { + _correlations.append(QVector(_cMatrix->_correlationSize, NAN)); + } +} + + + + + + +/*! + * Return the string representation of this pair, which is a comma-delimited + * string of each correlation in the pair. + */ +QString CorrelationMatrix::Pair::toString() const +{ + EDEBUG_FUNC(this); + + // if there are no correlations simply return null + if ( _correlations.isEmpty() ) + { + return tr(""); + } + + // initialize list of strings and iterate through all clusters + QStringList ret; + for (const auto& cluster : _correlations) + { + // initialize list of strings for cluster and iterate through each correlation + QStringList clusterStrings; + for (const auto& correlation : cluster) + { + // add correlation value as string + clusterStrings << QString::number(correlation); + } + + // join all cluster strings into one string + ret << clusterStrings.join(','); + } + + // join all clusters and return as string + return ret.join(','); +} + + + + + + +/*! + * Write a cluster in the iterator's pairwise data to the data object file. + * + * @param stream + * @param cluster + */ +void CorrelationMatrix::Pair::writeCluster(EDataStream& stream, int cluster) +{ + EDEBUG_FUNC(this,&stream,cluster); + + // make sure cluster value is within range + if ( cluster >= 0 && cluster < _correlations.size() ) + { + // write correlations per cluster to output stream + for (const auto& correlation : _correlations.at(cluster)) + { + stream << correlation; + } + } +} + + + + + + +/*! + * Read a cluster from the data object file into memory. + * + * @param stream + * @param cluster + */ +void CorrelationMatrix::Pair::readCluster(const EDataStream& stream, int cluster) const +{ + EDEBUG_FUNC(this,&stream,cluster); + + // make sure cluster value is within range + if ( cluster >= 0 && cluster < _correlations.size() ) + { + // read correlations per cluster from input stream + for (int i = 0; i < _cMatrix->_correlationSize ;++i) + { + float value; + stream >> value; + _correlations[cluster][i] = value; + } + } +} diff --git a/src/core/correlationmatrix_pair.h b/src/core/correlationmatrix_pair.h new file mode 100644 index 0000000..85bce3c --- /dev/null +++ b/src/core/correlationmatrix_pair.h @@ -0,0 +1,48 @@ +#ifndef CORRELATIONMATRIX_PAIR_H +#define CORRELATIONMATRIX_PAIR_H +#include "correlationmatrix.h" +#include "pairwise_matrix_pair.h" + + + +/*! + * This class implements the pairwise iterator for the correlation matrix data + * object. This class extends the behavior of the base pairwise iterator to read + * and write correlations. + */ +class CorrelationMatrix::Pair : public Pairwise::Matrix::Pair +{ +public: + Pair(CorrelationMatrix* matrix): + Matrix::Pair(matrix), + _cMatrix(matrix) + {} + Pair(const CorrelationMatrix* matrix): + Matrix::Pair(matrix), + _cMatrix(matrix) + {} + Pair() = default; + virtual void clearClusters() const { _correlations.clear(); } + virtual void addCluster(int amount = 1) const; + virtual int clusterSize() const { return _correlations.size(); } + virtual bool isEmpty() const { return _correlations.isEmpty(); } + QString toString() const; + const float& at(int cluster, int correlation) const + { return _correlations.at(cluster).at(correlation); } + float& at(int cluster, int correlation) { return _correlations[cluster][correlation]; } +private: + virtual void writeCluster(EDataStream& stream, int cluster); + virtual void readCluster(const EDataStream& stream, int cluster) const; + /*! + * Array of correlations for the current pair. + */ + mutable QVector> _correlations; + /*! + * Constant pointer to parent correlation matrix. + */ + const CorrelationMatrix* _cMatrix; +}; + + + +#endif diff --git a/src/core/datafactory.cpp b/src/core/datafactory.cpp index c5812fa..60506f0 100644 --- a/src/core/datafactory.cpp +++ b/src/core/datafactory.cpp @@ -12,8 +12,13 @@ using namespace std; +/*! + * Return the total number of data types this program implements. + */ quint16 DataFactory::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -22,8 +27,15 @@ quint16 DataFactory::size() const +/*! + * Return the display name for the given data type. + * + * @param type + */ QString DataFactory::name(quint16 type) const { + EDEBUG_FUNC(this,type); + switch (type) { case ExpressionMatrixType: return "Expression Matrix"; @@ -38,8 +50,15 @@ QString DataFactory::name(quint16 type) const +/*! + * Return the file extension for the given data type as a string. + * + * @param type + */ QString DataFactory::fileExtension(quint16 type) const { + EDEBUG_FUNC(this,type); + switch (type) { case ExpressionMatrixType: return "emx"; @@ -54,8 +73,15 @@ QString DataFactory::fileExtension(quint16 type) const +/*! + * Make and return a new abstract data object of the given type. + * + * @param type + */ unique_ptr DataFactory::make(quint16 type) const { + EDEBUG_FUNC(this,type); + switch (type) { case ExpressionMatrixType: return unique_ptr(new ExpressionMatrix); diff --git a/src/core/datafactory.h b/src/core/datafactory.h index d7d39b6..88c921c 100644 --- a/src/core/datafactory.h +++ b/src/core/datafactory.h @@ -4,9 +4,17 @@ +/*! + * This class implements the ACE data factory for producing new data objects + * and giving basic information about all available data types. + */ class DataFactory : public EAbstractDataFactory { public: + /*! + * Defines all available data types this program implements along with the total + * size. + */ enum Type { ExpressionMatrixType = 0 diff --git a/src/core/exportcorrelationmatrix.cpp b/src/core/exportcorrelationmatrix.cpp index 518d406..504513d 100644 --- a/src/core/exportcorrelationmatrix.cpp +++ b/src/core/exportcorrelationmatrix.cpp @@ -1,12 +1,27 @@ #include "exportcorrelationmatrix.h" #include "exportcorrelationmatrix_input.h" #include "datafactory.h" +#include "expressionmatrix_gene.h" +using namespace std; + + + + + + +/*! + * Return the total number of blocks this analytic must process as steps + * or blocks of work. This implementation uses a work block for writing + * each pair to the output file. + */ int ExportCorrelationMatrix::size() const { - return 1; + EDEBUG_FUNC(this); + + return _cmx->size(); } @@ -14,100 +29,111 @@ int ExportCorrelationMatrix::size() const -void ExportCorrelationMatrix::process(const EAbstractAnalytic::Block* result) +/*! + * Process the given index with a possible block of results if this analytic + * produces work blocks. This implementation uses only the index of the result + * block to determine which piece of work to do. + * + * @param result + */ +void ExportCorrelationMatrix::process(const EAbstractAnalytic::Block*) { - Q_UNUSED(result); - - // initialize pair iterators - CorrelationMatrix::Pair cmxPair(_cmx); - CCMatrix::Pair ccmPair(_ccm); + EDEBUG_FUNC(this); // initialize workspace QString sampleMask(_ccm->sampleSize(), '0'); - // create text stream to output file and write until end reached - QTextStream stream(_output); - stream.setRealNumberPrecision(6); + // read next pair + _cmxPair.readNext(); + _ccmPair.read(_cmxPair.index()); - // iterate through all pairs - while ( cmxPair.hasNext() ) + // write pairwise data to output file + for ( int k = 0; k < _cmxPair.clusterSize(); k++ ) { - // read next pair - cmxPair.readNext(); - - if ( cmxPair.clusterSize() > 1 ) + float correlation = _cmxPair.at(k, 0); + int numSamples = 0; + int numMissing = 0; + int numPostOutliers = 0; + int numPreOutliers = 0; + int numThreshold = 0; + + // if cluster data exists then use it + if ( _ccmPair.clusterSize() > 0 ) { - ccmPair.read(cmxPair.index()); + // compute summary statistics + for ( int i = 0; i < _ccm->sampleSize(); i++ ) + { + switch ( _ccmPair.at(k, i) ) + { + case 1: + numSamples++; + break; + case 6: + numThreshold++; + break; + case 7: + numPreOutliers++; + break; + case 8: + numPostOutliers++; + break; + case 9: + numMissing++; + break; + } + } + + // write sample mask to string + for ( int i = 0; i < _ccm->sampleSize(); i++ ) + { + sampleMask[i] = '0' + _ccmPair.at(k, i); + } } - // write pairwise data to output file - for ( int k = 0; k < cmxPair.clusterSize(); k++ ) + // otherwise use expression data + else { - float correlation = cmxPair.at(k, 0); - int numSamples = 0; - int numMissing = 0; - int numPostOutliers = 0; - int numPreOutliers = 0; - int numThreshold = 0; - - // if there are multiple clusters then use cluster data - if ( cmxPair.clusterSize() > 1 ) + // read in gene expressions + ExpressionMatrix::Gene gene1(_emx); + ExpressionMatrix::Gene gene2(_emx); + + gene1.read(_cmxPair.index().getX()); + gene2.read(_cmxPair.index().getY()); + + // determine sample mask, summary statistics from expression data + for ( int i = 0; i < _emx->sampleSize(); ++i ) { - // compute summary statistics - for ( int i = 0; i < _ccm->sampleSize(); i++ ) + if ( isnan(gene1.at(i)) || isnan(gene2.at(i)) ) { - switch ( ccmPair.at(k, i) ) - { - case 1: - numSamples++; - break; - case 6: - numThreshold++; - break; - case 7: - numPreOutliers++; - break; - case 8: - numPostOutliers++; - break; - case 9: - numMissing++; - break; - } + sampleMask[i] = '9'; + numMissing++; } - - // write sample mask to string - for ( int i = 0; i < _ccm->sampleSize(); i++ ) + else { - sampleMask[i] = '0' + ccmPair.at(k, i); + sampleMask[i] = '1'; + numSamples++; } } - - // else just initialize empty sample mask - else - { - sampleMask.fill('0'); - } - - // write cluster to output file - stream - << cmxPair.index().getX() - << "\t" << cmxPair.index().getY() - << "\t" << k - << "\t" << cmxPair.clusterSize() - << "\t" << numSamples - << "\t" << numMissing - << "\t" << numPostOutliers - << "\t" << numPreOutliers - << "\t" << numThreshold - << "\t" << correlation - << "\t" << sampleMask - << "\n"; } + + // write cluster to output file + _stream + << _cmxPair.index().getX() + << "\t" << _cmxPair.index().getY() + << "\t" << k + << "\t" << _cmxPair.clusterSize() + << "\t" << numSamples + << "\t" << numMissing + << "\t" << numPostOutliers + << "\t" << numPreOutliers + << "\t" << numThreshold + << "\t" << correlation + << "\t" << sampleMask + << "\n"; } // make sure writing output file worked - if ( stream.status() != QTextStream::Ok ) + if ( _stream.status() != QTextStream::Ok ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("File IO Error")); @@ -121,8 +147,13 @@ void ExportCorrelationMatrix::process(const EAbstractAnalytic::Block* result) +/*! + * Make a new input object and return its pointer. + */ EAbstractAnalytic::Input* ExportCorrelationMatrix::makeInput() { + EDEBUG_FUNC(this); + return new Input(this); } @@ -131,13 +162,28 @@ EAbstractAnalytic::Input* ExportCorrelationMatrix::makeInput() +/*! + * Initialize this analytic. This implementation checks to make sure the input + * data objects and output file have been set. + */ void ExportCorrelationMatrix::initialize() { - if ( !_ccm || !_cmx || !_output ) + EDEBUG_FUNC(this); + + // make sure input/output arguments are valid + if ( !_emx || !_ccm || !_cmx || !_output ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("Invalid Argument")); e.setDetails(tr("Did not get valid input and/or output arguments.")); throw e; } + + // initialize pairwise iterators + _ccmPair = CCMatrix::Pair(_ccm); + _cmxPair = CorrelationMatrix::Pair(_cmx); + + // initialize output file stream + _stream.setDevice(_output); + _stream.setRealNumberPrecision(8); } diff --git a/src/core/exportcorrelationmatrix.h b/src/core/exportcorrelationmatrix.h index efa16a7..9f965ca 100644 --- a/src/core/exportcorrelationmatrix.h +++ b/src/core/exportcorrelationmatrix.h @@ -2,11 +2,25 @@ #define EXPORTCORRELATIONMATRIX_H #include +#include "ccmatrix_pair.h" #include "ccmatrix.h" +#include "correlationmatrix_pair.h" #include "correlationmatrix.h" +#include "expressionmatrix.h" +/*! + * This class implements the export correlation matrix analytic. This analytic + * takes two data objects, a correlation matrix and a cluster matrix, and writes + * a text file of correlations, where each line is a correlation that includes + * the pairwise index, correlation value, and sample mask, as well as several + * other fields which are not used but are required for this format. The analytic + * attempts to recreate these fields as much as is possible. The expression matrix + * that was used to produce the correlation matrix must also be provided in order + * to recreate sample masks for pairs with only one cluster, as these sample masks + * are not stored in the cluster matrix. + */ class ExportCorrelationMatrix : public EAbstractAnalytic { Q_OBJECT @@ -17,8 +31,27 @@ class ExportCorrelationMatrix : public EAbstractAnalytic virtual EAbstractAnalytic::Input* makeInput() override final; virtual void initialize(); private: + /** + * Workspace variables to write to the output file + */ + QTextStream _stream; + CCMatrix::Pair _ccmPair; + CorrelationMatrix::Pair _cmxPair; + /*! + * Pointer to the input expression matrix. + */ + ExpressionMatrix* _emx {nullptr}; + /*! + * Pointer to the input cluster matrix. + */ CCMatrix* _ccm {nullptr}; + /*! + * Pointer to the input correlation matrix. + */ CorrelationMatrix* _cmx {nullptr}; + /*! + * Pointer to the output text file. + */ QFile* _output {nullptr}; }; diff --git a/src/core/exportcorrelationmatrix_input.cpp b/src/core/exportcorrelationmatrix_input.cpp index f2270a6..ad8af5f 100644 --- a/src/core/exportcorrelationmatrix_input.cpp +++ b/src/core/exportcorrelationmatrix_input.cpp @@ -3,18 +3,30 @@ +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ ExportCorrelationMatrix::Input::Input(ExportCorrelationMatrix* parent): EAbstractAnalytic::Input(parent), _base(parent) -{} +{ + EDEBUG_FUNC(this,parent); +} +/*! + * Return the total number of arguments this analytic type contains. + */ int ExportCorrelationMatrix::Input::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -23,10 +35,18 @@ int ExportCorrelationMatrix::Input::size() const +/*! + * Return the argument type for a given index. + * + * @param index + */ EAbstractAnalytic::Input::Type ExportCorrelationMatrix::Input::type(int index) const { + EDEBUG_FUNC(this,index); + switch (index) { + case ExpressionData: return Type::DataIn; case ClusterData: return Type::DataIn; case CorrelationData: return Type::DataIn; case OutputFile: return Type::FileOut; @@ -39,10 +59,27 @@ EAbstractAnalytic::Input::Type ExportCorrelationMatrix::Input::type(int index) c +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ QVariant ExportCorrelationMatrix::Input::data(int index, Role role) const { + EDEBUG_FUNC(this,index,role); + switch (index) { + case ExpressionData: + switch (role) + { + case Role::CommandLineName: return QString("emx"); + case Role::Title: return tr("Expression Matrix:"); + case Role::WhatsThis: return tr("Input expression matrix containing gene expression data."); + case Role::DataType: return DataFactory::ExpressionMatrixType; + default: return QVariant(); + } case ClusterData: switch (role) { @@ -79,10 +116,16 @@ QVariant ExportCorrelationMatrix::Input::data(int index, Role role) const -void ExportCorrelationMatrix::Input::set(int index, const QVariant& value) +/*! + * Set an argument with the given index to the given value. This analytic has + * no basic arguments so this function does nothing. + * + * @param index + * @param value + */ +void ExportCorrelationMatrix::Input::set(int, const QVariant&) { - Q_UNUSED(index); - Q_UNUSED(value); + EDEBUG_FUNC(this); } @@ -90,9 +133,21 @@ void ExportCorrelationMatrix::Input::set(int index, const QVariant& value) +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ void ExportCorrelationMatrix::Input::set(int index, EAbstractData* data) { - if ( index == ClusterData ) + EDEBUG_FUNC(this,index,data); + + if ( index == ExpressionData ) + { + _base->_emx = data->cast(); + } + else if ( index == ClusterData ) { _base->_ccm = data->cast(); } @@ -107,8 +162,16 @@ void ExportCorrelationMatrix::Input::set(int index, EAbstractData* data) +/*! + * Set a file argument with the given index to the given qt file pointer. + * + * @param index + * @param file + */ void ExportCorrelationMatrix::Input::set(int index, QFile* file) { + EDEBUG_FUNC(this,index,file); + if ( index == OutputFile ) { _base->_output = file; diff --git a/src/core/exportcorrelationmatrix_input.h b/src/core/exportcorrelationmatrix_input.h index 074dd98..7b46222 100644 --- a/src/core/exportcorrelationmatrix_input.h +++ b/src/core/exportcorrelationmatrix_input.h @@ -4,13 +4,20 @@ +/*! + * This class implements the abstract input of the export correlation matrix analytic. + */ class ExportCorrelationMatrix::Input : public EAbstractAnalytic::Input { Q_OBJECT public: + /*! + * Defines all input arguments for this analytic. + */ enum Argument { - ClusterData = 0 + ExpressionData = 0 + ,ClusterData ,CorrelationData ,OutputFile ,Total @@ -23,6 +30,9 @@ class ExportCorrelationMatrix::Input : public EAbstractAnalytic::Input virtual void set(int index, EAbstractData* data) override final; virtual void set(int index, QFile* file) override final; private: + /*! + * Pointer to the base analytic for this object. + */ ExportCorrelationMatrix* _base; }; diff --git a/src/core/exportexpressionmatrix.cpp b/src/core/exportexpressionmatrix.cpp index f6bbb11..2d962e1 100644 --- a/src/core/exportexpressionmatrix.cpp +++ b/src/core/exportexpressionmatrix.cpp @@ -1,15 +1,23 @@ #include "exportexpressionmatrix.h" #include "exportexpressionmatrix_input.h" #include "datafactory.h" +#include "expressionmatrix_gene.h" +/*! + * Return the total number of blocks this analytic must process as steps + * or blocks of work. This implementation uses a work block for writing the + * sample names and a work block for writing each gene. + */ int ExportExpressionMatrix::size() const { - return 1; + EDEBUG_FUNC(this); + + return 1 + _input->geneSize(); } @@ -17,79 +25,74 @@ int ExportExpressionMatrix::size() const +/*! + * Process the given index with a possible block of results if this analytic + * produces work blocks. This implementation uses only the index of the result + * block to determine which piece of work to do. + * + * @param result + */ void ExportExpressionMatrix::process(const EAbstractAnalytic::Block* result) { - Q_UNUSED(result); - - // use expression declaration - using Expression = ExpressionMatrix::Expression; - using Transform = ExpressionMatrix::Transform; + EDEBUG_FUNC(this,result); - // get gene names, sample names, and transform - EMetaArray geneNames = _input->getGeneNames().toArray(); - EMetaArray sampleNames = _input->getSampleNames().toArray(); - Transform transform = _input->getTransform(); + // write the sample names in the first step + if ( result->index() == 0 ) + { + // get sample names + EMetaArray sampleNames {_input->sampleNames()}; - // create text stream to output file - QTextStream stream(_output); - stream.setRealNumberPrecision(12); + // initialize output file stream + _stream.setDevice(_output); + _stream.setRealNumberPrecision(8); - // write sample names - for ( int i = 0; i < _input->getSampleSize(); i++ ) - { - stream << sampleNames.at(i).toString() << "\t"; + // write sample names + for ( int i = 0; i < _input->sampleSize(); i++ ) + { + _stream << sampleNames.at(i).toString() << "\t"; + } + _stream << "\n"; } - stream << "\n"; - // write each gene to a line in output file - ExpressionMatrix::Gene gene(_input); - for ( int i = 0; i < _input->getGeneSize(); i++ ) + // write each gene to the output file in a separate step + else { + // get gene index + int i = result->index() - 1; + + // get gene name + QString geneName {_input->geneNames().at(i).toString()}; + // load gene from expression matrix + ExpressionMatrix::Gene gene(_input); gene.read(i); // write gene name - stream << geneNames.at(i).toString(); + _stream << geneName; // write expression values - for ( int j = 0; j < _input->getSampleSize(); j++ ) + for ( int j = 0; j < _input->sampleSize(); j++ ) { - Expression value {gene.at(j)}; + float value {gene.at(j)}; // if value is NAN use the no sample token if ( std::isnan(value) ) { - stream << "\t" << _noSampleToken; + _stream << "\t" << _nanToken; } // else this is a normal floating point expression else { - // apply transform and write value - switch (transform) - { - case Transform::None: - break; - case Transform::NLog: - value = exp(value); - break; - case Transform::Log2: - value = pow(2, value); - break; - case Transform::Log10: - value = pow(10, value); - break; - } - - stream << "\t" << value; + _stream << "\t" << value; } } - stream << "\n"; + _stream << "\n"; } // make sure writing output file worked - if ( stream.status() != QTextStream::Ok ) + if ( _stream.status() != QTextStream::Ok ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("File IO Error")); @@ -103,8 +106,13 @@ void ExportExpressionMatrix::process(const EAbstractAnalytic::Block* result) +/*! + * Make a new input object and return its pointer. + */ EAbstractAnalytic::Input* ExportExpressionMatrix::makeInput() { + EDEBUG_FUNC(this); + return new Input(this); } @@ -113,8 +121,14 @@ EAbstractAnalytic::Input* ExportExpressionMatrix::makeInput() +/*! + * Initialize this analytic. This implementation checks to make sure the input + * data object and output file have been set. + */ void ExportExpressionMatrix::initialize() { + EDEBUG_FUNC(this); + if ( !_input || !_output ) { E_MAKE_EXCEPTION(e); diff --git a/src/core/exportexpressionmatrix.h b/src/core/exportexpressionmatrix.h index ca2923e..8e2d451 100644 --- a/src/core/exportexpressionmatrix.h +++ b/src/core/exportexpressionmatrix.h @@ -6,6 +6,13 @@ +/*! + * This class implements the export expression matrix analytic. This analytic + * writes an expression matrix to a text file as table; that is, with each row + * on a line, each value separated by whitespace, and the first row and column + * containing the row names and column names, respectively. Elements which are + * NAN in the expression matrix are written as the given NAN token. + */ class ExportExpressionMatrix : public EAbstractAnalytic { Q_OBJECT @@ -16,9 +23,22 @@ class ExportExpressionMatrix : public EAbstractAnalytic virtual EAbstractAnalytic::Input* makeInput() override final; virtual void initialize(); private: + /** + * Workspace variables to write to the output file + */ + QTextStream _stream; + /*! + * Pointer to the input expression matrix. + */ ExpressionMatrix* _input {nullptr}; + /*! + * Pointer to the output text file. + */ QFile* _output {nullptr}; - QString _noSampleToken; + /*! + * The string token used to represent NAN values. + */ + QString _nanToken {"NA"}; }; diff --git a/src/core/exportexpressionmatrix_input.cpp b/src/core/exportexpressionmatrix_input.cpp index bf38a67..02b7c58 100644 --- a/src/core/exportexpressionmatrix_input.cpp +++ b/src/core/exportexpressionmatrix_input.cpp @@ -6,18 +6,30 @@ +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ ExportExpressionMatrix::Input::Input(ExportExpressionMatrix* parent): EAbstractAnalytic::Input(parent), _base(parent) -{} +{ + EDEBUG_FUNC(this,parent); +} +/*! + * Return the total number of arguments this analytic type contains. + */ int ExportExpressionMatrix::Input::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -26,13 +38,20 @@ int ExportExpressionMatrix::Input::size() const +/*! + * Return the argument type for a given index. + * + * @param index + */ EAbstractAnalytic::Input::Type ExportExpressionMatrix::Input::type(int index) const { + EDEBUG_FUNC(this,index); + switch (index) { case InputData: return Type::DataIn; case OutputFile: return Type::FileOut; - case NoSampleToken: return Type::String; + case NANToken: return Type::String; default: return Type::Boolean; } } @@ -42,8 +61,16 @@ EAbstractAnalytic::Input::Type ExportExpressionMatrix::Input::type(int index) co +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ QVariant ExportExpressionMatrix::Input::data(int index, Role role) const { + EDEBUG_FUNC(this,index,role); + switch (index) { case InputData: @@ -64,12 +91,13 @@ QVariant ExportExpressionMatrix::Input::data(int index, Role role) const case Role::FileFilters: return tr("Text file %1").arg("(*.txt)"); default: return QVariant(); } - case NoSampleToken: + case NANToken: switch (role) { case Role::CommandLineName: return QString("nan"); - case Role::Title: return tr("No Sample Token:"); + case Role::Title: return tr("NAN Token:"); case Role::WhatsThis: return tr("Expected token for expressions that have no value."); + case Role::Default: return "NA"; default: return QVariant(); } default: return QVariant(); @@ -81,12 +109,20 @@ QVariant ExportExpressionMatrix::Input::data(int index, Role role) const +/*! + * Set an argument with the given index to the given value. + * + * @param index + * @param value + */ void ExportExpressionMatrix::Input::set(int index, const QVariant& value) { + EDEBUG_FUNC(this,index,&value); + switch (index) { - case NoSampleToken: - _base->_noSampleToken = value.toString(); + case NANToken: + _base->_nanToken = value.toString(); break; } } @@ -96,8 +132,16 @@ void ExportExpressionMatrix::Input::set(int index, const QVariant& value) +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ void ExportExpressionMatrix::Input::set(int index, EAbstractData* data) { + EDEBUG_FUNC(this,index,data); + if ( index == InputData ) { _base->_input = data->cast(); @@ -109,8 +153,16 @@ void ExportExpressionMatrix::Input::set(int index, EAbstractData* data) +/*! + * Set a file argument with the given index to the given qt file pointer. + * + * @param index + * @param file + */ void ExportExpressionMatrix::Input::set(int index, QFile* file) { + EDEBUG_FUNC(this,index,file); + if ( index == OutputFile ) { _base->_output = file; diff --git a/src/core/exportexpressionmatrix_input.h b/src/core/exportexpressionmatrix_input.h index 9c2dacb..420c6e1 100644 --- a/src/core/exportexpressionmatrix_input.h +++ b/src/core/exportexpressionmatrix_input.h @@ -4,15 +4,21 @@ +/*! + * This class implements the abstract input of the export expression matrix analytic. + */ class ExportExpressionMatrix::Input : public EAbstractAnalytic::Input { Q_OBJECT public: + /*! + * Defines all input arguments for this analytic. + */ enum Argument { InputData = 0 ,OutputFile - ,NoSampleToken + ,NANToken ,Total }; explicit Input(ExportExpressionMatrix* parent); @@ -23,6 +29,9 @@ class ExportExpressionMatrix::Input : public EAbstractAnalytic::Input virtual void set(int index, EAbstractData* data) override final; virtual void set(int index, QFile* file) override final; private: + /*! + * Pointer to the base analytic for this object. + */ ExportExpressionMatrix* _base; }; diff --git a/src/core/expressionmatrix.cpp b/src/core/expressionmatrix.cpp index 188b5d6..a25bd2f 100644 --- a/src/core/expressionmatrix.cpp +++ b/src/core/expressionmatrix.cpp @@ -1,27 +1,18 @@ #include "expressionmatrix.h" +#include "expressionmatrix_model.h" +// - - - -const QStringList ExpressionMatrix::TRANSFORM_NAMES -{ - "none" - ,"natural logarithm" - ,"logarithm base 2" - ,"logarithm base 10" -}; - - - - - - +/*! + * Return the index of the first byte in this data object after the end of + * the data section. Defined as the header size plus the size of the matrix data. + */ qint64 ExpressionMatrix::dataEnd() const { - // calculate and return end of data - return DATA_OFFSET + ((qint64)_geneSize * (qint64)_sampleSize * sizeof(Expression)); + EDEBUG_FUNC(this); + + return _headerSize + (qint64)_geneSize * (qint64)_sampleSize * sizeof(float); } @@ -29,10 +20,17 @@ qint64 ExpressionMatrix::dataEnd() const +/*! + * Read in the data of an existing data object that was just opened. + */ void ExpressionMatrix::readData() { - // read header + EDEBUG_FUNC(this); + + // seek to the beginning of the data seek(0); + + // read the header stream() >> _geneSize >> _sampleSize; } @@ -41,24 +39,21 @@ void ExpressionMatrix::readData() +/*! + * Initialize this data object's data to a null state. + */ void ExpressionMatrix::writeNewData() { - // initialize metadata - setMeta(EMetadata(EMetadata::Object)); - - // initialize header - seek(0); - stream() << _geneSize << _sampleSize; -} - - - + EDEBUG_FUNC(this); + // initialize metadata object + setMeta(EMetaObject()); + // seek to the beginning of the data + seek(0); -QAbstractTableModel* ExpressionMatrix::model() -{ - return nullptr; + // write the header + stream() << _geneSize << _sampleSize; } @@ -66,10 +61,18 @@ QAbstractTableModel* ExpressionMatrix::model() +/*! + * Finalize this data object's data after the analytic that created it has + * finished giving it new data. + */ void ExpressionMatrix::finish() { - // write header + EDEBUG_FUNC(this); + + // seek to the beginning of the data seek(0); + + // write the header stream() << _geneSize << _sampleSize; } @@ -78,55 +81,18 @@ void ExpressionMatrix::finish() -QVariant ExpressionMatrix::headerData(int section, Qt::Orientation orientation, int role) const +/*! + * Return a qt table model that represents this data object as a table. + */ +QAbstractTableModel* ExpressionMatrix::model() { - // if this is not display role return nothing - if ( role != Qt::DisplayRole ) - { - return QVariant(); - } + EDEBUG_FUNC(this); - // get metadata root and figure out orientation - switch (orientation) - { - case Qt::Vertical: - { - // get gene names and make sure it is array - EMetadata genes {meta().toObject().at("genes")}; - if ( genes.isArray() ) - { - // make sure section is within limits of array - if ( section >= 0 && section < genes.toArray().size() ) - { - // return gene name - return genes.toArray().at(section).toString(); - } - } - - // if no gene name found return nothing - return QVariant(); - } - case Qt::Horizontal: + if ( !_model ) { - // get sample names and make sure it is array - EMetadata samples {meta().toObject().at("samples")}; - if ( samples.isArray() ) - { - // make sure section is within limits of array - if ( section >= 0 && section < samples.toArray().size() ) - { - // return sample name - return samples.toArray().at(section).toString(); - } - } - - // if no sample name found return nothing - return QVariant(); - } - default: - // unknown orientation so return nothing - return QVariant(); + _model = new Model(this); } + return _model; } @@ -134,10 +100,13 @@ QVariant ExpressionMatrix::headerData(int section, Qt::Orientation orientation, -int ExpressionMatrix::rowCount(const QModelIndex& parent) const +/*! + * Return the number of genes (rows) in this expression matrix. + */ +qint32 ExpressionMatrix::geneSize() const { - // return gene size for row count - Q_UNUSED(parent); + EDEBUG_FUNC(this); + return _geneSize; } @@ -146,10 +115,13 @@ int ExpressionMatrix::rowCount(const QModelIndex& parent) const -int ExpressionMatrix::columnCount(const QModelIndex& parent) const +/*! + * Return the number of samples (columns) in this expression matrix. + */ +qint32 ExpressionMatrix::sampleSize() const { - // return sample size for column count - Q_UNUSED(parent); + EDEBUG_FUNC(this); + return _sampleSize; } @@ -158,29 +130,14 @@ int ExpressionMatrix::columnCount(const QModelIndex& parent) const -QVariant ExpressionMatrix::data(const QModelIndex& index, int role) const +/*! + * Return the list of gene names in this expression matrix. + */ +EMetaArray ExpressionMatrix::geneNames() const { - // if role is not display return nothing - if ( role != Qt::DisplayRole ) - { - return QVariant(); - } - - // if index is out of range return nothing - if ( index.row() >= _geneSize || index.column() >= _sampleSize ) - { - return QVariant(); - } - - // make input variable and seek to position of queried expression - Expression value; - seek(DATA_OFFSET+(((index.row()*_sampleSize)+index.column())*sizeof(Expression))); + EDEBUG_FUNC(this); - // read expression from file - stream() >> value; - - // return expression - return value; + return meta().toObject().at("genes").toArray(); } @@ -188,42 +145,14 @@ QVariant ExpressionMatrix::data(const QModelIndex& index, int role) const -void ExpressionMatrix::initialize(QStringList geneNames, QStringList sampleNames) +/*! + * Return the list of sample names in this expression matrix. + */ +EMetaArray ExpressionMatrix::sampleNames() const { - // create metadata array of gene names - EMetaArray metaGeneNames; - for ( auto& geneName : geneNames ) - { - metaGeneNames.append(geneName); - } - - // create metadata array of sample names - EMetaArray metaSampleNames; - for ( auto& sampleName : sampleNames ) - { - metaSampleNames.append(sampleName); - } - - // insert gene and sample names to data object's metadata - EMetaObject metaObject {meta().toObject()}; - metaObject.insert("genes", metaGeneNames); - metaObject.insert("samples", metaSampleNames); - setMeta(metaObject); - - // set gene and sample size - _geneSize = geneNames.size(); - _sampleSize = sampleNames.size(); -} - - - + EDEBUG_FUNC(this); - - -ExpressionMatrix::Transform ExpressionMatrix::getTransform() const -{ - QString transformName {meta().toObject().at("transform").toString()}; - return static_cast(TRANSFORM_NAMES.indexOf(transformName)); + return meta().toObject().at("samples").toArray(); } @@ -231,46 +160,32 @@ ExpressionMatrix::Transform ExpressionMatrix::getTransform() const -void ExpressionMatrix::setTransform(ExpressionMatrix::Transform transform) +/*! + * Return an array of this expression matrix's data in row-major order. + */ +QVector ExpressionMatrix::dumpRawData() const { - auto& transformName {TRANSFORM_NAMES.at(static_cast(transform))}; - - EMetaObject metaObject {meta().toObject()}; - metaObject.insert("transform", transformName); - setMeta(metaObject); -} - - + EDEBUG_FUNC(this); - - - -qint64 ExpressionMatrix::getRawSize() const -{ - return (qint64)_geneSize * (qint64)_sampleSize; -} - - - - - - -ExpressionMatrix::Expression* ExpressionMatrix::dumpRawData() const -{ - // if there are no genes do nothing + // return empty array if expression matrix is empty if ( _geneSize == 0 ) { - return nullptr; + return QVector(); } - // create new floating point array and populate with all gene expressions - Expression* ret {new Expression[getRawSize()]}; - for (int i = 0; i < _geneSize ;++i) + // allocate an array with the same size as the expression matrix + QVector ret(_geneSize*_sampleSize); + + // seek to the beginning of the expression data + seekExpression(0,0); + + // write each expression to the array + for (float& sample: ret) { - readGene(i,&ret[i*_sampleSize]); + stream() >> sample; } - // return new float array + // return the array return ret; } @@ -279,74 +194,40 @@ ExpressionMatrix::Expression* ExpressionMatrix::dumpRawData() const -EMetadata ExpressionMatrix::getGeneNames() const -{ - return meta().toObject().at("genes"); -} - - - - - - -EMetadata ExpressionMatrix::getSampleNames() const -{ - return meta().toObject().at("samples"); -} - - - - - - -void ExpressionMatrix::readGene(int index, Expression* expressions) const +/*! + * Initialize this expression matrix with a list of gene names and a list of + * sample names. + * + * @param geneNames + * @param sampleNames + */ +void ExpressionMatrix::initialize(const QStringList& geneNames, const QStringList& sampleNames) { - // seek to position of beginning of gene's expressions - seek(DATA_OFFSET + (index * _sampleSize * sizeof(Expression))); + EDEBUG_FUNC(this,&geneNames,&sampleNames); - // read in all expressions for gene as block of floats - for ( int i = 0; i < _sampleSize; ++i ) + // create a metadata array of gene names + EMetaArray metaGeneNames; + for ( auto& geneName : geneNames ) { - stream() >> expressions[i]; + metaGeneNames.append(geneName); } -} - - - - - -void ExpressionMatrix::writeGene(int index, const Expression* expressions) -{ - // seek to position of beginning of gene's expressions - seek(DATA_OFFSET + (index * _sampleSize * sizeof(Expression))); - - // overwrite all expressions for gene as block of floats - for ( int i = 0; i < _sampleSize; ++i ) + // create a metadata array of sample names + EMetaArray metaSampleNames; + for ( auto& sampleName : sampleNames ) { - stream() << expressions[i]; + metaSampleNames.append(sampleName); } -} + // save the gene names and sample names to metadata + EMetaObject metaObject {meta().toObject()}; + metaObject.insert("genes",metaGeneNames); + metaObject.insert("samples",metaSampleNames); + setMeta(metaObject); - - - - -void ExpressionMatrix::Gene::read(int index) const -{ - // make sure given gene index is within range - if ( index < 0 || index >= _matrix->_geneSize ) - { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("Domain Error")); - e.setDetails(tr("Attempting to read gene %1 when maximum is %2.").arg(index) - .arg(_matrix->_geneSize-1)); - throw e; - } - - // read gene expressions from data object - _matrix->readGene(index,_expressions); + // initialize the gene size and sample size accordingly + _geneSize = geneNames.size(); + _sampleSize = sampleNames.size(); } @@ -354,39 +235,30 @@ void ExpressionMatrix::Gene::read(int index) const -void ExpressionMatrix::Gene::write(int index) +/*! + * Seek to a particular expression in this expression matrix given a gene index + * and a sample index. + * + * @param gene + * @param sample + */ +void ExpressionMatrix::seekExpression(int gene, int sample) const { - // make sure given gene index is within range - if ( index < 0 || index >= _matrix->_geneSize ) - { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("Domain Error")); - e.setDetails(tr("Attempting to write gene %1 when maximum is %2.").arg(index) - .arg(_matrix->_geneSize-1)); - throw e; - } + EDEBUG_FUNC(this,gene,sample); - // write gene expressions to data object - _matrix->writeGene(index,_expressions); -} - - - - - - -ExpressionMatrix::Expression& ExpressionMatrix::Gene::at(int index) -{ - // make sure given sample index is within range - if ( index < 0 || index >= _matrix->_sampleSize ) + // make sure that the indices are valid + if ( gene < 0 || gene >= _geneSize || sample < 0 || sample >= _sampleSize ) { E_MAKE_EXCEPTION(e); - e.setTitle(tr("Domain Error")); - e.setDetails(tr("Attempting to access gene expression %1 when maximum is %2.").arg(index) - .arg(_matrix->_sampleSize-1)); + e.setTitle(tr("Invalid Argument")); + e.setDetails(tr("Invalid (gene,sample) index (%1,%2) with size of (%1,%2).") + .arg(gene) + .arg(sample) + .arg(_geneSize) + .arg(_sampleSize)); throw e; } - // return gene expression - return _expressions[index]; + // seek to the specified position in the data + seek(_headerSize + ((qint64)gene*(qint64)_sampleSize + (qint64)sample)*sizeof(float)); } diff --git a/src/core/expressionmatrix.h b/src/core/expressionmatrix.h index e11d9fc..6a84d99 100644 --- a/src/core/expressionmatrix.h +++ b/src/core/expressionmatrix.h @@ -1,68 +1,55 @@ #ifndef EXPRESSIONMATRIX_H #define EXPRESSIONMATRIX_H #include +// +/*! + * This class implements the expression matrix data object. An expression matrix + * is a matrix of real numbers whose rows represent genes and whose columns + * represent samples. The matrix data can be accessed using the gene interator, + * which iterates through each gene (row) in the matrix. + */ class ExpressionMatrix : public EAbstractData { Q_OBJECT public: - using Expression = float; - static const QStringList TRANSFORM_NAMES; - enum class Transform - { - None - ,NLog - ,Log2 - ,Log10 - }; class Gene; +public: virtual qint64 dataEnd() const override final; virtual void readData() override final; virtual void writeNewData() override final; - virtual QAbstractTableModel* model() override final; virtual void finish() override final; - QVariant headerData(int section, Qt::Orientation orientation, int role) const; - int rowCount(const QModelIndex& parent) const; - int columnCount(const QModelIndex& parent) const; - QVariant data(const QModelIndex& index, int role) const; - void initialize(QStringList geneNames, QStringList sampleNames); - Transform getTransform() const; - void setTransform(Transform scale); - qint32 getGeneSize() const { return _geneSize; } - qint32 getSampleSize() const { return _sampleSize; } - qint64 getRawSize() const; - Expression* dumpRawData() const; - EMetadata getGeneNames() const; - EMetadata getSampleNames() const; -private: - void readGene(int index, Expression* expressions) const; - void writeGene(int index, const Expression* expressions); - static const int DATA_OFFSET {8}; - qint32 _geneSize {0}; - qint32 _sampleSize {0}; -}; - - - -class ExpressionMatrix::Gene -{ + virtual QAbstractTableModel* model() override final; public: - Gene(ExpressionMatrix* matrix): - _expressions(new Expression[matrix->_sampleSize]), - _matrix(matrix) - {} - Gene(const Gene&) = delete; - ~Gene() { delete _expressions; } - void read(int index) const; - void write(int index); - Expression& at(int index); - const Expression& at(int index) const; - Expression& operator[](int index) { return _expressions[index]; } + qint32 geneSize() const; + qint32 sampleSize() const; + EMetaArray geneNames() const; + EMetaArray sampleNames() const; + QVector dumpRawData() const; + void initialize(const QStringList& geneNames, const QStringList& sampleNames); +private: + class Model; private: - Expression* _expressions; - ExpressionMatrix* _matrix; + void seekExpression(int gene, int sample) const; + /*! + * The header size (in bytes) at the beginning of the file. The header + * consists of the gene size and the sample size. + */ + constexpr static const qint64 _headerSize {8}; + /*! + * The number of genes (rows) in the expression matrix. + */ + qint32 _geneSize; + /*! + * The number of samples (columns) in the expression matrix. + */ + qint32 _sampleSize; + /*! + * Pointer to a qt table model for this class. + */ + Model* _model {nullptr}; }; diff --git a/src/core/expressionmatrix_gene.cpp b/src/core/expressionmatrix_gene.cpp new file mode 100644 index 0000000..7d11e1b --- /dev/null +++ b/src/core/expressionmatrix_gene.cpp @@ -0,0 +1,222 @@ +#include "expressionmatrix_gene.h" +// + + + + + + +/*! + * Return the expression value at the given index. + * + * @param index + */ +float& ExpressionMatrix::Gene::operator[](int index) +{ + EDEBUG_FUNC(this,index); + + // make sure the index is valid + if ( index < 0 || index >= _matrix->_sampleSize ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Domain Error")); + e.setDetails(tr("Attempting to access gene expression %1 when maximum is %2.").arg(index) + .arg(_matrix->_sampleSize-1)); + throw e; + } + + // return the specified value + return _expressions[index]; +} + + + + + + +/*! + * Construct a gene iterator for an expression matrix. Additionally, if the + * matrix is already initialized, read the first gene into memory. + * + * @param matrix + * @param isInitialized + */ +ExpressionMatrix::Gene::Gene(ExpressionMatrix* matrix, bool isInitialized): + _matrix(matrix), + _expressions(new float[matrix->sampleSize()]) +{ + EDEBUG_FUNC(this,matrix,isInitialized); + + if ( isInitialized ) + { + read(_index); + } +} + + + + + + +/*! + * Destruct a gene iterator. + */ +ExpressionMatrix::Gene::~Gene() +{ + EDEBUG_FUNC(this); + + delete[] _expressions; +} + + + + + + +/*! + * Read a row of the expression matrix from the data object file into memory. + * + * @param index + */ +void ExpressionMatrix::Gene::read(int index) +{ + EDEBUG_FUNC(this,index); + + // make sure the index is valid + if ( index < 0 || index >= _matrix->_geneSize ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Domain Error")); + e.setDetails(tr("Attempting to read gene %1 when maximum is %2.").arg(index) + .arg(_matrix->_geneSize-1)); + throw e; + } + + // seek to the beginning of the specified row in the data object file + _matrix->seekExpression(index,0); + + // read the entire row into memory + for (int i = 0; i < _matrix->sampleSize() ;++i) + { + _matrix->stream() >> _expressions[i]; + } + + // set the iterator's current index + _index = index; +} + + + + + + +/*! + * Read the next row of the expression matrix into memory. + */ +bool ExpressionMatrix::Gene::readNext() +{ + EDEBUG_FUNC(this); + + // make sure that there is another row in the expression matrix + if ( (_index + 1) >= _matrix->_geneSize ) + { + return false; + } + + // read the next row + read(_index + 1); + + // return success + return true; +} + + + + + + +/*! + * Write the iterator's row data to the data object file corresponding to + * the given row index. + * + * @param index + */ +void ExpressionMatrix::Gene::write(int index) +{ + EDEBUG_FUNC(this,index); + + // make sure the index is valid + if ( index < 0 || index >= _matrix->_geneSize ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Domain Error")); + e.setDetails(tr("Attempting to write gene %1 when maximum is %2.").arg(index) + .arg(_matrix->_geneSize-1)); + throw e; + } + + // seek to the beginning of the specified row in the data object file + _matrix->seekExpression(index,0); + + // write the entire row to the data object + for (int i = 0; i < _matrix->sampleSize() ;++i) + { + _matrix->stream() << _expressions[i]; + } + + // set the iterator's current index + _index = index; +} + + + + + + +/*! + * Write the iterator's row data to the next row in the data object file. + */ +bool ExpressionMatrix::Gene::writeNext() +{ + EDEBUG_FUNC(this); + + // make sure there is another row in the expression matrix + if ( (_index + 1) >= _matrix->_geneSize ) + { + return false; + } + + // write to the next row + write(_index + 1); + + // return success + return true; +} + + + + + + +/*! + * Return the expression value at the given index. + * + * @param index + */ +float ExpressionMatrix::Gene::at(int index) const +{ + EDEBUG_FUNC(this,index); + + // make sure the index is valid + if ( index < 0 || index >= _matrix->_sampleSize ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Domain Error")); + e.setDetails(tr("Attempting to access gene expression %1 when maximum is %2.").arg(index) + .arg(_matrix->_sampleSize-1)); + throw e; + } + + // return the specified value + return _expressions[index]; +} diff --git a/src/core/expressionmatrix_gene.h b/src/core/expressionmatrix_gene.h new file mode 100644 index 0000000..4aeef15 --- /dev/null +++ b/src/core/expressionmatrix_gene.h @@ -0,0 +1,43 @@ +#ifndef EXPRESSIONMATRIX_GENE_H +#define EXPRESSIONMATRIX_GENE_H +#include "expressionmatrix.h" +// + + + +/*! + * This class implements the gene iterator for the expression matrix data + * object. The gene iterator can read from or write to any gene (row) in the + * expression matrix, or simply iterate through each row. The iterator stores + * only one row of expression data in memory at a time. + */ +class ExpressionMatrix::Gene +{ +public: + float& operator[](int index); +public: + Gene(ExpressionMatrix* matrix, bool isInitialized = false); + ~Gene(); + void read(int index); + bool readNext(); + void write(int index); + bool writeNext(); + float at(int index) const; +private: + /*! + * Pointer to the parent expression matrix. + */ + ExpressionMatrix* _matrix; + /*! + * The iterator's current position in the expression matrix. + */ + int _index {0}; + /*! + * Pointer to the expression data of the current gene. + */ + float* _expressions; +}; + + + +#endif diff --git a/src/core/expressionmatrix_model.cpp b/src/core/expressionmatrix_model.cpp new file mode 100644 index 0000000..2a5ddee --- /dev/null +++ b/src/core/expressionmatrix_model.cpp @@ -0,0 +1,148 @@ +#include "expressionmatrix_model.h" +// + + + + + + +/*! + * Construct a table model for an expression matrix. + * + * @param matrix + */ +ExpressionMatrix::Model::Model(ExpressionMatrix* matrix): + _matrix(matrix) +{ + EDEBUG_FUNC(this,matrix); + + setParent(matrix); +} + + + + + + +/*! + * Return a header name for the table model using a given index and + * orientation (row / column). + * + * @param section + * @param orientation + * @param role + */ +QVariant ExpressionMatrix::Model::headerData(int section, Qt::Orientation orientation, int role) const +{ + EDEBUG_FUNC(this,section,orientation,role); + + // make sure the role is valid + if ( role != Qt::DisplayRole ) + { + return QVariant(); + } + + // determine whether to return a row name or column name + switch (orientation) + { + case Qt::Vertical: + { + // get gene names + EMetaArray geneNames {_matrix->geneNames()}; + + // make sure the index is valid + if ( section >= 0 && section < geneNames.size() ) + { + // return the specified row name + return geneNames.at(section).toString(); + } + + // otherwise return empty string + return QVariant(); + } + case Qt::Horizontal: + { + // get sample names + EMetaArray samples {_matrix->sampleNames()}; + + // make sure the index is valid + if ( section >= 0 && section < samples.size() ) + { + // return the specified column name + return samples.at(section).toString(); + } + + // otherwise return empty string + return QVariant(); + } + default: + // return empty string if orientation is not valid + return QVariant(); + } +} + + + + + + +/*! + * Return the number of rows in the table model. + */ +int ExpressionMatrix::Model::rowCount(const QModelIndex&) const +{ + EDEBUG_FUNC(this); + + return _matrix->_geneSize; +} + + + + + + +/*! + * Return the number of columns in the table model. + */ +int ExpressionMatrix::Model::columnCount(const QModelIndex&) const +{ + EDEBUG_FUNC(this); + + return _matrix->_sampleSize; +} + + + + + + +/*! + * Return a data element in the table model using the given index. + * + * @param index + * @param role + */ +QVariant ExpressionMatrix::Model::data(const QModelIndex& index, int role) const +{ + EDEBUG_FUNC(this,&index,role); + + // make sure the index and role are valid + if ( !index.isValid() || role != Qt::DisplayRole ) + { + return QVariant(); + } + + // make sure the index is within the bounds of the expression matrix + if ( index.row() >= _matrix->_geneSize || index.column() >= _matrix->_sampleSize ) + { + return QVariant(); + } + + // get the specified value from the expression matrix + float value; + _matrix->seekExpression(index.row(),index.column()); + _matrix->stream() >> value; + + // return the specified value + return value; +} diff --git a/src/core/expressionmatrix_model.h b/src/core/expressionmatrix_model.h new file mode 100644 index 0000000..a5b29e3 --- /dev/null +++ b/src/core/expressionmatrix_model.h @@ -0,0 +1,29 @@ +#ifndef EXPRESSIONMATRIX_MODEL_H +#define EXPRESSIONMATRIX_MODEL_H +#include "expressionmatrix.h" +// + + + +/*! + * This class implements the qt table model for the expression matrix + * data object, which represents the expression matrix as a table. + */ +class ExpressionMatrix::Model : public QAbstractTableModel +{ +public: + Model(ExpressionMatrix* matrix); + virtual QVariant headerData(int section, Qt::Orientation orientation, int role) const override final; + virtual int rowCount(const QModelIndex& parent) const override final; + virtual int columnCount(const QModelIndex& parent) const override final; + virtual QVariant data(const QModelIndex& index, int role) const override final; +private: + /*! + * Pointer to the data object for this table model. + */ + ExpressionMatrix* _matrix; +}; + + + +#endif diff --git a/src/core/extract.cpp b/src/core/extract.cpp index c62fe2f..50684f8 100644 --- a/src/core/extract.cpp +++ b/src/core/extract.cpp @@ -1,6 +1,7 @@ #include "extract.h" #include "extract_input.h" #include "datafactory.h" +#include "expressionmatrix_gene.h" @@ -11,9 +12,16 @@ using namespace std; +/*! + * Return the total number of blocks this analytic must process as steps + * or blocks of work. This implementation uses a work block for writing + * each pair to the output file. + */ int Extract::size() const { - return 1; + EDEBUG_FUNC(this); + + return _cmx->size(); } @@ -21,251 +29,290 @@ int Extract::size() const +/*! + * Process the given index with a possible block of results if this analytic + * produces work blocks. This implementation uses only the index of the result + * block to determine which piece of work to do. + * + * @param result + */ void Extract::process(const EAbstractAnalytic::Block* result) { - Q_UNUSED(result); + EDEBUG_FUNC(this,result); + + // write pair according to the output format + switch ( _outputFormat ) + { + case OutputFormat::Text: + writeTextFormat(result->index()); + break; + case OutputFormat::GraphML: + writeGraphMLFormat(result->index()); + break; + } +} + + + + - // initialize pair iterators - CorrelationMatrix::Pair cmxPair(_cmx); - CCMatrix::Pair ccmPair(_ccm); + +/*! + * Write the next pair using the text format. + * + * @param index + */ +void Extract::writeTextFormat(int index) +{ + EDEBUG_FUNC(this); // get gene names - EMetaArray geneNames {_cmx->geneNames().toArray()}; + EMetaArray geneNames {_cmx->geneNames()}; // initialize workspace QString sampleMask(_ccm->sampleSize(), '0'); - // create text stream to output file and write until end reached - QTextStream stream(_output); - stream.setRealNumberPrecision(6); - // write header to file - stream - << "Source" - << "\t" << "Target" - << "\t" << "sc" - << "\t" << "Interaction" - << "\t" << "Cluster" - << "\t" << "Num_Clusters" - << "\t" << "Cluster_Samples" - << "\t" << "Missing_Samples" - << "\t" << "Cluster_Outliers" - << "\t" << "Pair_Outliers" - << "\t" << "Too_Low" - << "\t" << "Samples" - << "\n"; - - // increment through all gene pairs - while ( cmxPair.hasNext() ) + if ( index == 0 ) { - // read next gene pair - cmxPair.readNext(); + _stream + << "Source" + << "\t" << "Target" + << "\t" << "sc" + << "\t" << "Interaction" + << "\t" << "Cluster" + << "\t" << "Num_Clusters" + << "\t" << "Cluster_Samples" + << "\t" << "Missing_Samples" + << "\t" << "Cluster_Outliers" + << "\t" << "Pair_Outliers" + << "\t" << "Too_Low" + << "\t" << "Samples" + << "\n"; + } - if ( cmxPair.clusterSize() > 1 ) + // read next pair + _cmxPair.readNext(); + _ccmPair.read(_cmxPair.index()); + + // write pairwise data to output file + for ( int k = 0; k < _cmxPair.clusterSize(); k++ ) + { + QString source {geneNames.at(_cmxPair.index().getX()).toString()}; + QString target {geneNames.at(_cmxPair.index().getY()).toString()}; + float correlation {_cmxPair.at(k, 0)}; + QString interaction {"co"}; + int numSamples {0}; + int numMissing {0}; + int numPostOutliers {0}; + int numPreOutliers {0}; + int numThreshold {0}; + + // exclude cluster if correlation is not within thresholds + if ( fabs(correlation) < _minCorrelation || _maxCorrelation < fabs(correlation) ) { - ccmPair.read(cmxPair.index()); + continue; } - // write gene pair data to output file - for ( int k = 0; k < cmxPair.clusterSize(); k++ ) + // if cluster data exists then use it + if ( _ccmPair.clusterSize() > 0 ) { - auto& source {geneNames.at(cmxPair.index().getX()).toString()}; - auto& target {geneNames.at(cmxPair.index().getY()).toString()}; - float correlation {cmxPair.at(k, 0)}; - QString interaction {"co"}; - int numSamples {0}; - int numMissing {0}; - int numPostOutliers {0}; - int numPreOutliers {0}; - int numThreshold {0}; - - // exclude cluster if correlation is not within thresholds - if ( fabs(correlation) < _minCorrelation || _maxCorrelation < fabs(correlation) ) - { - continue; - } - - // if there are multiple clusters then use cluster data - if ( cmxPair.clusterSize() > 1 ) + // compute summary statistics + for ( int i = 0; i < _ccm->sampleSize(); i++ ) { - // compute summary statistics - for ( int i = 0; i < _ccm->sampleSize(); i++ ) + switch ( _ccmPair.at(k, i) ) { - switch ( ccmPair.at(k, i) ) - { - case 1: - numSamples++; - break; - case 6: - numThreshold++; - break; - case 7: - numPreOutliers++; - break; - case 8: - numPostOutliers++; - break; - case 9: - numMissing++; - break; - } - } - - // write sample mask to string - for ( int i = 0; i < _ccm->sampleSize(); i++ ) - { - sampleMask[i] = '0' + ccmPair.at(k, i); + case 1: + numSamples++; + break; + case 6: + numThreshold++; + break; + case 7: + numPreOutliers++; + break; + case 8: + numPostOutliers++; + break; + case 9: + numMissing++; + break; } } - // otherwise use expression data - else + // write sample mask to string + for ( int i = 0; i < _ccm->sampleSize(); i++ ) { - // read in gene expressions - ExpressionMatrix::Gene gene1(_emx); - ExpressionMatrix::Gene gene2(_emx); + sampleMask[i] = '0' + _ccmPair.at(k, i); + } + } + + // otherwise use expression data + else + { + // read in gene expressions + ExpressionMatrix::Gene gene1(_emx); + ExpressionMatrix::Gene gene2(_emx); - gene1.read(cmxPair.index().getX()); - gene2.read(cmxPair.index().getY()); + gene1.read(_cmxPair.index().getX()); + gene2.read(_cmxPair.index().getY()); - // determine sample mask from expression data - for ( int i = 0; i < _emx->getSampleSize(); ++i ) + // determine sample mask, summary statistics from expression data + for ( int i = 0; i < _emx->sampleSize(); ++i ) + { + if ( isnan(gene1.at(i)) || isnan(gene2.at(i)) ) { - if ( isnan(gene1.at(i)) || isnan(gene2.at(i)) ) - { - sampleMask[i] = '9'; - } - else - { - sampleMask[i] = '1'; - } + sampleMask[i] = '9'; + numMissing++; + } + else + { + sampleMask[i] = '1'; + numSamples++; } } - - // write cluster to output file - stream - << source - << "\t" << target - << "\t" << correlation - << "\t" << interaction - << "\t" << k - << "\t" << cmxPair.clusterSize() - << "\t" << numSamples - << "\t" << numMissing - << "\t" << numPostOutliers - << "\t" << numPreOutliers - << "\t" << numThreshold - << "\t" << sampleMask - << "\n"; } + + // write cluster to output file + _stream + << source + << "\t" << target + << "\t" << correlation + << "\t" << interaction + << "\t" << k + << "\t" << _cmxPair.clusterSize() + << "\t" << numSamples + << "\t" << numMissing + << "\t" << numPostOutliers + << "\t" << numPreOutliers + << "\t" << numThreshold + << "\t" << sampleMask + << "\n"; } // make sure writing output file worked - if ( stream.status() != QTextStream::Ok ) + if ( _stream.status() != QTextStream::Ok ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("File IO Error")); e.setDetails(tr("Qt Text Stream encountered an unknown error.")); throw e; } +} - // reset gene pair iterator - cmxPair.reset(); - // create text stream to graphml file and write until end reached - stream.setDevice(_graphml); - // write header to file - stream - << "\n" - << "\n" - << " \n"; - - // write each node to file - for ( int i = 0; i < _cmx->geneSize(); i++ ) + + + +/*! + * Write the next pair using the GraphML format. + * + * @param index + */ +void Extract::writeGraphMLFormat(int index) +{ + EDEBUG_FUNC(this); + + // get gene names + EMetaArray geneNames {_cmx->geneNames()}; + + // initialize workspace + QString sampleMask(_ccm->sampleSize(), '0'); + + if ( index == 0 ) { - auto& id {geneNames.at(i).toString()}; + // write header to file + _stream + << "\n" + << "\n" + << " \n"; + + // write node list to file + for ( int i = 0; i < _cmx->geneSize(); i++ ) + { + QString id {geneNames.at(i).toString()}; - stream << " \n"; + _stream << " \n"; + } + } + + // read next pair + _cmxPair.readNext(); + + if ( _cmxPair.clusterSize() > 1 ) + { + _ccmPair.read(_cmxPair.index()); } - // increment through all gene pairs - while ( cmxPair.hasNext() ) + // write pairwise data to net file + for ( int k = 0; k < _cmxPair.clusterSize(); k++ ) { - // read next gene pair - cmxPair.readNext(); + QString source {geneNames.at(_cmxPair.index().getX()).toString()}; + QString target {geneNames.at(_cmxPair.index().getY()).toString()}; + float correlation {_cmxPair.at(k, 0)}; - if ( cmxPair.clusterSize() > 1 ) + // exclude edge if correlation is not within thresholds + if ( fabs(correlation) < _minCorrelation || _maxCorrelation < fabs(correlation) ) { - ccmPair.read(cmxPair.index()); + continue; } - // write gene pair edges to file - for ( int k = 0; k < cmxPair.clusterSize(); k++ ) + // if there are multiple clusters then use cluster data + if ( _cmxPair.clusterSize() > 1 ) { - auto& source {geneNames.at(cmxPair.index().getX()).toString()}; - auto& target {geneNames.at(cmxPair.index().getY()).toString()}; - float correlation {cmxPair.at(k, 0)}; - - // exclude edge if correlation is not within thresholds - if ( fabs(correlation) < _minCorrelation || _maxCorrelation < fabs(correlation) ) + // write sample mask to string + for ( int i = 0; i < _ccm->sampleSize(); i++ ) { - continue; + sampleMask[i] = '0' + _ccmPair.at(k, i); } + } - // if there are multiple clusters then use cluster data - if ( cmxPair.clusterSize() > 1 ) + // otherwise use expression data + else + { + // read in gene expressions + ExpressionMatrix::Gene gene1(_emx); + ExpressionMatrix::Gene gene2(_emx); + + gene1.read(_cmxPair.index().getX()); + gene2.read(_cmxPair.index().getY()); + + // determine sample mask from expression data + for ( int i = 0; i < _emx->sampleSize(); ++i ) { - // write sample mask to string - for ( int i = 0; i < _ccm->sampleSize(); i++ ) + if ( isnan(gene1.at(i)) || isnan(gene2.at(i)) ) { - sampleMask[i] = '0' + ccmPair.at(k, i); + sampleMask[i] = '9'; } - } - - // otherwise use expression data - else - { - // read in gene expressions - ExpressionMatrix::Gene gene1(_emx); - ExpressionMatrix::Gene gene2(_emx); - - gene1.read(cmxPair.index().getX()); - gene2.read(cmxPair.index().getY()); - - // determine sample mask from expression data - for ( int i = 0; i < _emx->getSampleSize(); ++i ) + else { - if ( isnan(gene1.at(i)) || isnan(gene2.at(i)) ) - { - sampleMask[i] = '9'; - } - else - { - sampleMask[i] = '1'; - } + sampleMask[i] = '1'; } } - - // write edge to file - stream - << " \n"; } + + // write edge to file + _stream + << " \n"; } // write footer to file - stream - << " \n" - << "\n"; + if ( index == size() - 1 ) + { + _stream + << " \n" + << "\n"; + } - // make sure writing graphml file worked - if ( stream.status() != QTextStream::Ok ) + // make sure writing output file worked + if ( _stream.status() != QTextStream::Ok ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("File IO Error")); @@ -279,8 +326,13 @@ void Extract::process(const EAbstractAnalytic::Block* result) +/*! + * Make a new input object and return its pointer. + */ EAbstractAnalytic::Input* Extract::makeInput() { + EDEBUG_FUNC(this); + return new Input(this); } @@ -289,8 +341,15 @@ EAbstractAnalytic::Input* Extract::makeInput() +/*! + * Initialize this analytic. This implementation checks to make sure the input + * data objects and output file have been set. + */ void Extract::initialize() { + EDEBUG_FUNC(this); + + // make sure input/output arguments are valid if ( !_emx || !_ccm || !_cmx || !_output ) { E_MAKE_EXCEPTION(e); @@ -298,4 +357,12 @@ void Extract::initialize() e.setDetails(tr("Did not get valid input and/or output arguments.")); throw e; } + + // initialize pairwise iterators + _ccmPair = CCMatrix::Pair(_ccm); + _cmxPair = CorrelationMatrix::Pair(_cmx); + + // initialize output file stream + _stream.setDevice(_output); + _stream.setRealNumberPrecision(8); } diff --git a/src/core/extract.h b/src/core/extract.h index d36e0b8..ed3f7e9 100644 --- a/src/core/extract.h +++ b/src/core/extract.h @@ -2,12 +2,23 @@ #define EXTRACT_H #include +#include "ccmatrix_pair.h" #include "ccmatrix.h" +#include "correlationmatrix_pair.h" #include "correlationmatrix.h" #include "expressionmatrix.h" +/*! + * This class implements the extract analytic. This analytic is very similar to + * the export correlation matrix analytic, except for a few differences: (1) this + * analytic uses a slightly different format for the text file, (2) this analytic + * can apply a correlation threshold, and (3) this analytic can optionally write + * a GraphML file. The key difference is that this analytic "extracts" a network + * from the correlation matrix and writes an edge list rather than a correlation + * list. + */ class Extract : public EAbstractAnalytic { Q_OBJECT @@ -18,12 +29,55 @@ class Extract : public EAbstractAnalytic virtual EAbstractAnalytic::Input* makeInput() override final; virtual void initialize(); private: + /*! + * Defines the output formats this analytic supports. + */ + enum class OutputFormat + { + /*! + * Text format + */ + Text + /*! + * GraphML format + */ + ,GraphML + }; + void writeTextFormat(int index); + void writeGraphMLFormat(int index); + /** + * Workspace variables to write to the output file + */ + QTextStream _stream; + CCMatrix::Pair _ccmPair; + CorrelationMatrix::Pair _cmxPair; + /*! + * Pointer to the input expression matrix. + */ ExpressionMatrix* _emx {nullptr}; + /*! + * Pointer to the input cluster matrix. + */ CCMatrix* _ccm {nullptr}; + /*! + * Pointer to the input correlation matrix. + */ CorrelationMatrix* _cmx {nullptr}; + /*! + * The output format to use. + */ + OutputFormat _outputFormat {OutputFormat::Text}; + /*! + * Pointer to the output text file. + */ QFile* _output {nullptr}; - QFile* _graphml {nullptr}; + /*! + * The minimum (absolute) correlation threshold. + */ float _minCorrelation {0.85}; + /*! + * The maximum (absolute) correlation threshold. + */ float _maxCorrelation {1.00}; }; diff --git a/src/core/extract_input.cpp b/src/core/extract_input.cpp index 8be4809..5d264de 100644 --- a/src/core/extract_input.cpp +++ b/src/core/extract_input.cpp @@ -3,25 +3,46 @@ -using namespace std; +/*! + * String list of output formats for this analytic that correspond exactly + * to its enumeration. Used for handling the output format argument for this + * input object. + */ +const QStringList Extract::Input::FORMAT_NAMES +{ + "text" + ,"graphml" +}; +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ Extract::Input::Input(Extract* parent): EAbstractAnalytic::Input(parent), _base(parent) -{} +{ + EDEBUG_FUNC(this,parent); +} +/*! + * Return the total number of arguments this analytic type contains. + */ int Extract::Input::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -30,15 +51,22 @@ int Extract::Input::size() const +/*! + * Return the argument type for a given index. + * + * @param index + */ EAbstractAnalytic::Input::Type Extract::Input::type(int index) const { + EDEBUG_FUNC(this,index); + switch (index) { case ExpressionData: return Type::DataIn; case ClusterData: return Type::DataIn; case CorrelationData: return Type::DataIn; + case OutputFormatArg: return Type::Selection; case OutputFile: return Type::FileOut; - case GraphMLFile: return Type::FileOut; case MinCorrelation: return Type::Double; case MaxCorrelation: return Type::Double; default: return Type::Boolean; @@ -50,8 +78,16 @@ EAbstractAnalytic::Input::Type Extract::Input::type(int index) const +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ QVariant Extract::Input::data(int index, Role role) const { + EDEBUG_FUNC(this,index,role); + switch (index) { case ExpressionData: @@ -81,22 +117,22 @@ QVariant Extract::Input::data(int index, Role role) const case Role::DataType: return DataFactory::CorrelationMatrixType; default: return QVariant(); } - case OutputFile: + case OutputFormatArg: switch (role) { - case Role::CommandLineName: return QString("output"); - case Role::Title: return tr("Output File:"); - case Role::WhatsThis: return tr("Output text file that will contain network edges."); - case Role::FileFilters: return tr("Text file %1").arg("(*.txt)"); + case Role::CommandLineName: return QString("format"); + case Role::Title: return tr("Output Format:"); + case Role::WhatsThis: return tr("Format to use for the output file."); + case Role::SelectionValues: return FORMAT_NAMES; + case Role::Default: return "text"; default: return QVariant(); } - case GraphMLFile: + case OutputFile: switch (role) { - case Role::CommandLineName: return QString("graphml"); - case Role::Title: return tr("GraphML File:"); - case Role::WhatsThis: return tr("Output text file that will contain network in GraphML format."); - case Role::FileFilters: return tr("GraphML file %1").arg("(*.graphml)"); + case Role::CommandLineName: return QString("output"); + case Role::Title: return tr("Output File:"); + case Role::WhatsThis: return tr("Output file that will contain network in the specified format."); default: return QVariant(); } case MinCorrelation: @@ -130,10 +166,21 @@ QVariant Extract::Input::data(int index, Role role) const +/*! + * Set an argument with the given index to the given value. + * + * @param index + * @param value + */ void Extract::Input::set(int index, const QVariant& value) { + EDEBUG_FUNC(this,index,&value); + switch (index) { + case OutputFormatArg: + _base->_outputFormat = static_cast(FORMAT_NAMES.indexOf(value.toString())); + break; case MinCorrelation: _base->_minCorrelation = value.toFloat(); break; @@ -148,8 +195,16 @@ void Extract::Input::set(int index, const QVariant& value) +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ void Extract::Input::set(int index, EAbstractData* data) { + EDEBUG_FUNC(this,index,data); + if ( index == ExpressionData ) { _base->_emx = data->cast(); @@ -169,14 +224,18 @@ void Extract::Input::set(int index, EAbstractData* data) +/*! + * Set a file argument with the given index to the given qt file pointer. + * + * @param index + * @param file + */ void Extract::Input::set(int index, QFile* file) { + EDEBUG_FUNC(this,index,file); + if ( index == OutputFile ) { _base->_output = file; } - else if ( index == GraphMLFile ) - { - _base->_graphml = file; - } } diff --git a/src/core/extract_input.h b/src/core/extract_input.h index 45803cd..eefaf0f 100644 --- a/src/core/extract_input.h +++ b/src/core/extract_input.h @@ -4,17 +4,23 @@ +/*! + * This class implements the abstract input of the extract analytic. + */ class Extract::Input : public EAbstractAnalytic::Input { Q_OBJECT public: + /*! + * Defines all input arguments for this analytic. + */ enum Argument { ExpressionData = 0 ,ClusterData ,CorrelationData + ,OutputFormatArg ,OutputFile - ,GraphMLFile ,MinCorrelation ,MaxCorrelation ,Total @@ -27,6 +33,10 @@ class Extract::Input : public EAbstractAnalytic::Input virtual void set(int index, EAbstractData* data) override final; virtual void set(int index, QFile* file) override final; private: + static const QStringList FORMAT_NAMES; + /*! + * Pointer to the base analytic for this object. + */ Extract* _base; }; diff --git a/src/core/importcorrelationmatrix.cpp b/src/core/importcorrelationmatrix.cpp index d905090..e437a88 100644 --- a/src/core/importcorrelationmatrix.cpp +++ b/src/core/importcorrelationmatrix.cpp @@ -1,16 +1,22 @@ #include "importcorrelationmatrix.h" #include "importcorrelationmatrix_input.h" #include "datafactory.h" -#include "pairwise_index.h" +/*! + * Return the total number of blocks this analytic must process as steps + * or blocks of work. This implementation uses a work block for each line + * of the input file. + */ int ImportCorrelationMatrix::size() const { - return 1; + EDEBUG_FUNC(this); + + return _numLines; } @@ -19,132 +25,106 @@ int ImportCorrelationMatrix::size() const +/*! + * Process the given index with a possible block of results if this analytic + * produces work blocks. This implementation uses only the index of the result + * block to determine which piece of work to do. + * + * @param result + */ void ImportCorrelationMatrix::process(const EAbstractAnalytic::Block* result) { - Q_UNUSED(result); + EDEBUG_FUNC(this,result); - // build gene name metadata - EMetaArray metaGeneNames; - for ( int i = 0; i < _geneSize; ++i ) - { - metaGeneNames.append(QString::number(i)); - } + // read a line from input file + QString line = _stream.readLine(); + auto words = line.splitRef(QRegExp("\\s+"), QString::SkipEmptyParts); - // build sample name metadata - EMetaArray metaSampleNames; - for ( int i = 0; i < _sampleSize; ++i ) + // make sure the line is valid + if ( words.size() == 11 ) { - metaSampleNames.append(QString::number(i)); - } - - // build correlation name metadata - EMetaArray metaCorrelationNames; - metaCorrelationNames.append(_correlationName); + int geneX = words[0].toInt(); + int geneY = words[1].toInt(); + float correlation = words[9].toFloat(); + QStringRef sampleMask = words[10]; - // initialize output data - _ccm->initialize(metaGeneNames, _maxClusterSize, metaSampleNames); - _cmx->initialize(metaGeneNames, _maxClusterSize, metaCorrelationNames); - - Pairwise::Index index; - CCMatrix::Pair ccmPair(_ccm); - CorrelationMatrix::Pair cmxPair(_cmx); + // make sure sample mask has correct length + if ( sampleMask.size() != _sampleSize ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Parsing Error")); + e.setDetails(tr("Encountered sample mask with invalid length %1. Sample size is %2.") + .arg(sampleMask.size()) + .arg(_sampleSize)); + throw e; + } - // create text stream from input file and read until end reached - QTextStream stream(_input); - while ( !stream.atEnd() ) - { - // read a line from text file - QString line = stream.readLine(); - auto words = line.splitRef(QRegExp("\\s+"), QString::SkipEmptyParts); + // save previous pair when new pair is read + Pairwise::Index nextIndex(geneX, geneY); - // make sure the line is valid - if ( words.size() == 11 ) + if ( _index != nextIndex ) { - int geneX = words[0].toInt(); - int geneY = words[1].toInt(); - float correlation = words[9].toFloat(); - QStringRef sampleMask = words[10]; - - // make sure sample mask has correct length - if ( sampleMask.size() != _sampleSize ) + // save pairs + if ( _ccmPair.clusterSize() > 1 ) { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("Parsing Error")); - e.setDetails(tr("Encountered sample mask with invalid length %1. " - "Sample size is %2.") - .arg(sampleMask.size()).arg(_sampleSize)); - throw e; + _ccmPair.write(_index); } - // save previous pair when new pair is read - Pairwise::Index nextIndex(geneX, geneY); - - if ( index != nextIndex ) + if ( _cmxPair.clusterSize() > 0 ) { - // save pairs - if ( ccmPair.clusterSize() > 1 ) - { - ccmPair.write(index); - } - - if ( cmxPair.clusterSize() > 0 ) - { - cmxPair.write(index); - } - - // reset pairs - ccmPair.clearClusters(); - cmxPair.clearClusters(); - - // update index - index = nextIndex; + _cmxPair.write(_index); } - // append data to ccm pair and cmx pair - int cluster = ccmPair.clusterSize(); + // reset pairs + _ccmPair.clearClusters(); + _cmxPair.clearClusters(); - ccmPair.addCluster(); - cmxPair.addCluster(); + // update index + _index = nextIndex; + } - for ( int i = 0; i < sampleMask.size(); ++i ) - { - ccmPair.at(cluster, i) = sampleMask[i].digitValue(); - } + // append data to ccm pair and cmx pair + int cluster = _ccmPair.clusterSize(); - cmxPair.at(cluster, 0) = correlation; - } + _ccmPair.addCluster(); + _cmxPair.addCluster(); - // save last pair - if ( ccmPair.clusterSize() > 1 ) + for ( int i = 0; i < sampleMask.size(); ++i ) { - ccmPair.write(index); + _ccmPair.at(cluster, i) = sampleMask[i].digitValue(); } - if ( cmxPair.clusterSize() > 0 ) - { - cmxPair.write(index); - } + _cmxPair.at(cluster, 0) = correlation; + } + + // otherwise throw an error + else + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Parsing Error")); + e.setDetails(tr("Encountered line with incorrect amount of fields. " + "Read %1 fields when there should have been %2.") + .arg(words.size()) + .arg(11)); + throw e; + } - // skip empty lines and lines with '#' markers - else if ( words.size() != 1 && words.size() != 0 ) + // save last pair + if ( result->index() == _numLines - 1 ) + { + if ( _ccmPair.clusterSize() > 1 ) { - continue; + _ccmPair.write(_index); } - // otherwise throw an error - else + if ( _cmxPair.clusterSize() > 0 ) { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("Parsing Error")); - e.setDetails(tr("Encountered line with incorrect amount of fields. " - "Read %1 fields when there should have been %2.") - .arg(words.size()).arg(11)); - throw e; + _cmxPair.write(_index); } } // make sure reading input file worked - if ( stream.status() != QTextStream::Ok ) + if ( _stream.status() != QTextStream::Ok ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("File IO Error")); @@ -158,8 +138,13 @@ void ImportCorrelationMatrix::process(const EAbstractAnalytic::Block* result) +/*! + * Make a new input object and return its pointer. + */ EAbstractAnalytic::Input* ImportCorrelationMatrix::makeInput() { + EDEBUG_FUNC(this); + return new Input(this); } @@ -168,8 +153,16 @@ EAbstractAnalytic::Input* ImportCorrelationMatrix::makeInput() +/*! + * Initialize this analytic. This implementation checks to make sure the input + * file and output data objects have been set, and that a correlation name was + * provided. + */ void ImportCorrelationMatrix::initialize() { + EDEBUG_FUNC(this); + + // make sure input/output arguments are valid if ( !_input || !_ccm || !_cmx ) { E_MAKE_EXCEPTION(e); @@ -178,6 +171,7 @@ void ImportCorrelationMatrix::initialize() throw e; } + // make sure correlation name is valid if ( _correlationName.isEmpty() ) { E_MAKE_EXCEPTION(e); @@ -185,4 +179,54 @@ void ImportCorrelationMatrix::initialize() e.setDetails(tr("Correlation name is required.")); throw e; } + + // initialize input file stream + _stream.setDevice(_input); + + // count the number of lines in the input file + _numLines = 0; + + while ( !_stream.atEnd() ) + { + _stream.readLine(); + _numLines++; + } + + // return stream to beginning of the input file + _stream.seek(0); + + // make sure reading input file worked + if ( _stream.status() != QTextStream::Ok ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("File IO Error")); + e.setDetails(tr("Qt Text Stream encountered an unknown error.")); + throw e; + } + + // build gene name metadata + EMetaArray metaGeneNames; + for ( int i = 0; i < _geneSize; ++i ) + { + metaGeneNames.append(QString::number(i)); + } + + // build sample name metadata + EMetaArray metaSampleNames; + for ( int i = 0; i < _sampleSize; ++i ) + { + metaSampleNames.append(QString::number(i)); + } + + // build correlation name metadata + EMetaArray metaCorrelationNames; + metaCorrelationNames.append(_correlationName); + + // initialize output data + _ccm->initialize(metaGeneNames, _maxClusterSize, metaSampleNames); + _cmx->initialize(metaGeneNames, _maxClusterSize, metaCorrelationNames); + + // initialize pairwise iterators + _ccmPair = CCMatrix::Pair(_ccm); + _cmxPair = CorrelationMatrix::Pair(_cmx); } diff --git a/src/core/importcorrelationmatrix.h b/src/core/importcorrelationmatrix.h index c62aaf2..271ac52 100644 --- a/src/core/importcorrelationmatrix.h +++ b/src/core/importcorrelationmatrix.h @@ -2,11 +2,24 @@ #define IMPORTCORRELATIONMATRIX_H #include +#include "ccmatrix_pair.h" #include "ccmatrix.h" +#include "correlationmatrix_pair.h" #include "correlationmatrix.h" +/*! + * This class implements the import correlation matrix analytic. This analytic + * reads in a text file of correlations, where each line is a correlation that + * includes the pairwise index, correlation value, and sample mask, as well as + * several other fields which are not used. This analytic produces two data + * objects: a correlation matrix containing the pairwise correlations, and a + * cluster matrix containing the sample masks for each pairwise cluster. There + * are several fields which are not represented in the text file and therefore + * must be specified manually, including the gene size, sample size, max cluster + * size, and correlation name. + */ class ImportCorrelationMatrix : public EAbstractAnalytic { Q_OBJECT @@ -17,12 +30,42 @@ class ImportCorrelationMatrix : public EAbstractAnalytic virtual EAbstractAnalytic::Input* makeInput() override final; virtual void initialize(); private: + /** + * Workspace variables to read from the input file. + */ + QTextStream _stream; + int _numLines {0}; + Pairwise::Index _index {0}; + CCMatrix::Pair _ccmPair; + CorrelationMatrix::Pair _cmxPair; + /*! + * Pointer to the input text file. + */ QFile* _input {nullptr}; + /*! + * Pointer to the output cluster matrix. + */ CCMatrix* _ccm {nullptr}; + /*! + * Pointer to the output correlation matrix. + */ CorrelationMatrix* _cmx {nullptr}; + /*! + * The number of genes in the correlation matrix. + */ qint32 _geneSize {0}; + /*! + * The maximum number of clusters allowed in a single pair of the + * correlation matrix. + */ qint32 _maxClusterSize {1}; + /*! + * The number of samples in the sample masks of the cluster matrix. + */ qint32 _sampleSize {0}; + /*! + * The name of the correlation used in the correlation matrix. + */ QString _correlationName; }; diff --git a/src/core/importcorrelationmatrix_input.cpp b/src/core/importcorrelationmatrix_input.cpp index 1601bf4..985ab9e 100644 --- a/src/core/importcorrelationmatrix_input.cpp +++ b/src/core/importcorrelationmatrix_input.cpp @@ -4,18 +4,30 @@ +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ ImportCorrelationMatrix::Input::Input(ImportCorrelationMatrix* parent): EAbstractAnalytic::Input(parent), _base(parent) -{} +{ + EDEBUG_FUNC(this,parent); +} +/*! + * Return the total number of arguments this analytic type contains. + */ int ImportCorrelationMatrix::Input::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -24,8 +36,15 @@ int ImportCorrelationMatrix::Input::size() const +/*! + * Return the argument type for a given index. + * + * @param index + */ EAbstractAnalytic::Input::Type ImportCorrelationMatrix::Input::type(int index) const { + EDEBUG_FUNC(this,index); + switch (index) { case InputFile: return Type::FileIn; @@ -44,8 +63,16 @@ EAbstractAnalytic::Input::Type ImportCorrelationMatrix::Input::type(int index) c +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ QVariant ImportCorrelationMatrix::Input::data(int index, Role role) const { + EDEBUG_FUNC(this,index,role); + switch (index) { case InputFile: @@ -122,8 +149,16 @@ QVariant ImportCorrelationMatrix::Input::data(int index, Role role) const +/*! + * Set an argument with the given index to the given value. + * + * @param index + * @param value + */ void ImportCorrelationMatrix::Input::set(int index, const QVariant& value) { + EDEBUG_FUNC(this,index,&value); + switch (index) { case GeneSize: @@ -146,8 +181,16 @@ void ImportCorrelationMatrix::Input::set(int index, const QVariant& value) +/*! + * Set a file argument with the given index to the given qt file pointer. + * + * @param index + * @param file + */ void ImportCorrelationMatrix::Input::set(int index, QFile* file) { + EDEBUG_FUNC(this,index,file); + if ( index == InputFile ) { _base->_input = file; @@ -159,8 +202,16 @@ void ImportCorrelationMatrix::Input::set(int index, QFile* file) +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ void ImportCorrelationMatrix::Input::set(int index, EAbstractData* data) { + EDEBUG_FUNC(this,index,data); + if ( index == ClusterData ) { _base->_ccm = data->cast(); diff --git a/src/core/importcorrelationmatrix_input.h b/src/core/importcorrelationmatrix_input.h index cfd2ed3..0ae567d 100644 --- a/src/core/importcorrelationmatrix_input.h +++ b/src/core/importcorrelationmatrix_input.h @@ -4,10 +4,16 @@ +/*! + * This class implements the abstract input of the import correlation matrix analytic. + */ class ImportCorrelationMatrix::Input : public EAbstractAnalytic::Input { Q_OBJECT public: + /*! + * Defines all input arguments for this analytic. + */ enum Argument { InputFile = 0 @@ -27,6 +33,9 @@ class ImportCorrelationMatrix::Input : public EAbstractAnalytic::Input virtual void set(int index, QFile* file) override final; virtual void set(int index, EAbstractData* data) override final; private: + /*! + * Pointer to the base analytic for this object. + */ ImportCorrelationMatrix* _base; }; diff --git a/src/core/importexpressionmatrix.cpp b/src/core/importexpressionmatrix.cpp index 3c6d30c..3472ea2 100644 --- a/src/core/importexpressionmatrix.cpp +++ b/src/core/importexpressionmatrix.cpp @@ -1,15 +1,24 @@ #include "importexpressionmatrix.h" #include "importexpressionmatrix_input.h" #include "datafactory.h" +#include "expressionmatrix_gene.h" +/*! + * Return the total number of blocks this analytic must process as steps + * or blocks of work. This implementation uses a work block for reading + * each line of the input file, plus one work block to create the output + * data object. + */ int ImportExpressionMatrix::size() const { - return 1; + EDEBUG_FUNC(this); + + return _numLines + 1; } @@ -17,74 +26,77 @@ int ImportExpressionMatrix::size() const +/*! + * Process the given index with a possible block of results if this analytic + * produces work blocks. This implementation uses only the index of the result + * block to determine which piece of work to do. + * + * @param result + */ void ImportExpressionMatrix::process(const EAbstractAnalytic::Block* result) { - Q_UNUSED(result); - - // use expression declaration - using Expression = ExpressionMatrix::Expression; + EDEBUG_FUNC(this, result); - // structure for building list of genes - struct Gene + // read or create the sample names in the first step + if ( result->index() == 0 ) { - Gene(int size) - { - expressions = new Expression[size]; - } - ~Gene() + // seek to the beginning of the input file + _stream.seek(0); + + // if sample size is not zero then build sample name list + if ( _sampleSize != 0 ) { - delete[] expressions; + for (int i = 0; i < _sampleSize ;++i) + { + _sampleNames.append(QString::number(i)); + } } - Expression* expressions; - }; + // otherwise read sample names from first line + else + { + // read a line from the input file + QString line = _stream.readLine(); + auto words = line.splitRef(QRegExp("\\s+"), QString::SkipEmptyParts); - // initialize gene expression linked list - QList genes; + // parse the sample names + _sampleSize = words.size(); - // initialize gene and sample name lists - QStringList geneNames; - QStringList sampleNames; + for ( auto& word : words ) + { + _sampleNames.append(word.toString()); + } - // if sample size is not zero then build sample name list - if ( _sampleSize != 0 ) - { - for (int i = 0; i < _sampleSize ;++i) - { - sampleNames.append(QString::number(i)); + // make sure reading input file worked + if ( _stream.status() != QTextStream::Ok ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("File IO Error")); + e.setDetails(tr("Qt Text Stream encountered an unknown error.")); + throw e; + } } } - // create text stream from input file and read until end reached - QTextStream stream(_input); - while ( !stream.atEnd() ) + // read each gene from the input file in a separate step + else if ( result->index() < _numLines ) { - // read a line from text file - QString line = stream.readLine(); + // read a line from the input file + QString line = _stream.readLine(); auto words = line.splitRef(QRegExp("\\s+"), QString::SkipEmptyParts); - // read sample names from first line - if ( _sampleSize == 0 ) - { - _sampleSize = words.size(); - for ( auto& word : words ) - { - sampleNames.append(word.toString()); - } - } - // make sure the number of words matches expected sample size - else if ( words.size() == _sampleSize + 1 ) + if ( words.size() == _sampleSize + 1 ) { // read row from text file into gene - Gene* gene {new Gene(_sampleSize)}; + Gene gene(_sampleSize); for ( int i = 1; i < words.size(); ++i ) { - // if word matches no sample token string set it as such - if ( words.at(i) == _noSampleToken ) + // if word matches the nan token then set it as such + if ( words.at(i) == _nanToken ) { - gene->expressions[i-1] = NAN; + gene.expressions[i-1] = NAN; } // else this is a normal floating point expression @@ -92,7 +104,7 @@ void ImportExpressionMatrix::process(const EAbstractAnalytic::Block* result) { // read in the floating point value bool ok; - Expression value = words.at(i).toDouble(&ok); + gene.expressions[i-1] = words.at(i).toDouble(&ok); // make sure reading worked if ( !ok ) @@ -100,32 +112,16 @@ void ImportExpressionMatrix::process(const EAbstractAnalytic::Block* result) E_MAKE_EXCEPTION(e); e.setTitle(tr("Parsing Error")); e.setDetails(tr("Failed to read expression value \"%1\" for gene %2.") - .arg(words.at(i).toString()).arg(words.at(0).toString())); + .arg(words.at(i).toString()) + .arg(words.at(0).toString())); throw e; } - - // apply transform and append expression to gene - switch (_transform) - { - case Transform::None: - gene->expressions[i-1] = value; - break; - case Transform::NLog: - gene->expressions[i-1] = log(value); - break; - case Transform::Log2: - gene->expressions[i-1] = log2(value); - break; - case Transform::Log10: - gene->expressions[i-1] = log10(value); - break; - } } } // append gene data and gene name - genes.append(gene); - geneNames.append(words.at(0).toString()); + _genes.append(gene); + _geneNames.append(words.at(0).toString()); } // otherwise throw an error @@ -135,38 +131,42 @@ void ImportExpressionMatrix::process(const EAbstractAnalytic::Block* result) e.setTitle(tr("Parsing Error")); e.setDetails(tr("Encountered gene expression line with incorrect amount of fields. " "Read in %1 fields when it should have been %2. Gene name is %3.") - .arg(words.size()-1).arg(_sampleSize).arg(words.at(0).toString())); + .arg(words.size()-1) + .arg(_sampleSize) + .arg(words.at(0).toString())); + throw e; + } + + // make sure reading input file worked + if ( _stream.status() != QTextStream::Ok ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("File IO Error")); + e.setDetails(tr("Qt Text Stream encountered an unknown error.")); throw e; } } - // make sure reading input file worked - if ( stream.status() != QTextStream::Ok ) + // create the output data object in the final step + else if ( result->index() == _numLines ) { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("File IO Error")); - e.setDetails(tr("Qt Text Stream encountered an unknown error.")); - throw e; - } + // initialize expression matrix + _output->initialize(_geneNames, _sampleNames); - // initialize expression matrix - _output->initialize(geneNames, sampleNames); + // iterate through each gene + ExpressionMatrix::Gene gene(_output); - // iterate through each gene - ExpressionMatrix::Gene gene(_output); - for ( int i = 0; i < _output->getGeneSize(); ++i ) - { - // save each gene to expression matrix - for ( int j = 0; j < _output->getSampleSize(); ++j ) + for ( int i = 0; i < _output->geneSize(); ++i ) { - gene[j] = genes[i]->expressions[j]; - } + // save each gene to expression matrix + for ( int j = 0; j < _output->sampleSize(); ++j ) + { + gene[j] = _genes[i].expressions[j]; + } - gene.write(i); + gene.write(i); + } } - - // set transform used in expression matrix - _output->setTransform(_transform); } @@ -174,8 +174,13 @@ void ImportExpressionMatrix::process(const EAbstractAnalytic::Block* result) +/*! + * Make a new input object and return its pointer. + */ EAbstractAnalytic::Input* ImportExpressionMatrix::makeInput() { + EDEBUG_FUNC(this); + return new Input(this); } @@ -184,8 +189,15 @@ EAbstractAnalytic::Input* ImportExpressionMatrix::makeInput() +/*! + * Initialize this analytic. This implementation checks to make sure the input + * file and output data object have been set. + */ void ImportExpressionMatrix::initialize() { + EDEBUG_FUNC(this); + + // make sure input/output arguments are valid if ( !_input || !_output ) { E_MAKE_EXCEPTION(e); @@ -193,4 +205,25 @@ void ImportExpressionMatrix::initialize() e.setDetails(tr("Did not get valid input and/or output arguments.")); throw e; } + + // initialize input file stream + _stream.setDevice(_input); + + // count the number of lines in the input file + _numLines = 0; + + while ( !_stream.atEnd() ) + { + _stream.readLine(); + _numLines++; + } + + // make sure reading input file worked + if ( _stream.status() != QTextStream::Ok ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("File IO Error")); + e.setDetails(tr("Qt Text Stream encountered an unknown error.")); + throw e; + } } diff --git a/src/core/importexpressionmatrix.h b/src/core/importexpressionmatrix.h index 0b8e34f..b368635 100644 --- a/src/core/importexpressionmatrix.h +++ b/src/core/importexpressionmatrix.h @@ -6,6 +6,15 @@ +/*! + * This class implements the import expression matrix analytic. This analytic + * reads in a text file which contains a matrix as a table; that is, with each row + * on a line, each value separated by whitespace, and the first row and column + * containing the row names and column names, respectively. Elements which have + * the given NAN token are read in as NAN. If the sample names are not in the + * input file, the user must provide the number of samples to the analytic, and + * the samples will be given integer names. + */ class ImportExpressionMatrix : public EAbstractAnalytic { Q_OBJECT @@ -16,12 +25,43 @@ class ImportExpressionMatrix : public EAbstractAnalytic virtual EAbstractAnalytic::Input* makeInput() override final; virtual void initialize(); private: - using Transform = ExpressionMatrix::Transform; + /** + * Structure used to load gene expression data + */ + struct Gene + { + Gene() = default; + Gene(int size) + { + expressions.resize(size); + } + + QVector expressions; + }; + /** + * Workspace variables to read from the input file. + */ + QTextStream _stream; + int _numLines {0}; + QVector _genes; + QStringList _geneNames; + QStringList _sampleNames; + /*! + * Pointer to the input text file. + */ QFile* _input {nullptr}; + /*! + * Pointer to the output expression matrix. + */ ExpressionMatrix* _output {nullptr}; - QString _noSampleToken; + /*! + * The string token used to represent NAN values. + */ + QString _nanToken {"NA"}; + /*! + * The number of samples to read. + */ qint32 _sampleSize {0}; - Transform _transform {Transform::None}; }; diff --git a/src/core/importexpressionmatrix_input.cpp b/src/core/importexpressionmatrix_input.cpp index 91b1f9f..3cb59c7 100644 --- a/src/core/importexpressionmatrix_input.cpp +++ b/src/core/importexpressionmatrix_input.cpp @@ -6,18 +6,30 @@ +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ ImportExpressionMatrix::Input::Input(ImportExpressionMatrix* parent): EAbstractAnalytic::Input(parent), _base(parent) -{} +{ + EDEBUG_FUNC(this,parent); +} +/*! + * Return the total number of arguments this analytic type contains. + */ int ImportExpressionMatrix::Input::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -26,15 +38,21 @@ int ImportExpressionMatrix::Input::size() const +/*! + * Return the argument type for a given index. + * + * @param index + */ EAbstractAnalytic::Input::Type ImportExpressionMatrix::Input::type(int index) const { + EDEBUG_FUNC(this,index); + switch (index) { case InputFile: return Type::FileIn; case OutputData: return Type::DataOut; - case NoSampleToken: return Type::String; + case NANToken: return Type::String; case SampleSize: return Type::Integer; - case TransformType: return Type::Selection; default: return Type::Boolean; } } @@ -44,8 +62,16 @@ EAbstractAnalytic::Input::Type ImportExpressionMatrix::Input::type(int index) co +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ QVariant ImportExpressionMatrix::Input::data(int index, Role role) const { + EDEBUG_FUNC(this,index,role); + switch (index) { case InputFile: @@ -66,12 +92,13 @@ QVariant ImportExpressionMatrix::Input::data(int index, Role role) const case Role::DataType: return DataFactory::ExpressionMatrixType; default: return QVariant(); } - case NoSampleToken: + case NANToken: switch (role) { case Role::CommandLineName: return QString("nan"); - case Role::Title: return tr("No Sample Token:"); + case Role::Title: return tr("NAN Token:"); case Role::WhatsThis: return tr("Expected token for expressions that have no value."); + case Role::Default: return "NA"; default: return QVariant(); } case SampleSize: @@ -85,16 +112,6 @@ QVariant ImportExpressionMatrix::Input::data(int index, Role role) const case Role::Maximum: return std::numeric_limits::max(); default: return QVariant(); } - case TransformType: - switch (role) - { - case Role::CommandLineName: return QString("transform"); - case Role::Title: return tr("Transform:"); - case Role::WhatsThis: return tr("Element-wise transformation to apply to expression data."); - case Role::Default: return ExpressionMatrix::TRANSFORM_NAMES.first(); - case Role::SelectionValues: return ExpressionMatrix::TRANSFORM_NAMES; - default: return QVariant(); - } default: return QVariant(); } } @@ -104,18 +121,23 @@ QVariant ImportExpressionMatrix::Input::data(int index, Role role) const +/*! + * Set an argument with the given index to the given value. + * + * @param index + * @param value + */ void ImportExpressionMatrix::Input::set(int index, const QVariant& value) { + EDEBUG_FUNC(this,index,&value); + switch (index) { case SampleSize: _base->_sampleSize = value.toInt(); break; - case NoSampleToken: - _base->_noSampleToken = value.toString(); - break; - case TransformType: - _base->_transform = static_cast(ExpressionMatrix::TRANSFORM_NAMES.indexOf(value.toString())); + case NANToken: + _base->_nanToken = value.toString(); break; } } @@ -125,8 +147,16 @@ void ImportExpressionMatrix::Input::set(int index, const QVariant& value) +/*! + * Set a file argument with the given index to the given qt file pointer. + * + * @param index + * @param file + */ void ImportExpressionMatrix::Input::set(int index, QFile* file) { + EDEBUG_FUNC(this,index,file); + if ( index == InputFile ) { _base->_input = file; @@ -138,8 +168,16 @@ void ImportExpressionMatrix::Input::set(int index, QFile* file) +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ void ImportExpressionMatrix::Input::set(int index, EAbstractData* data) { + EDEBUG_FUNC(this,index,data); + if ( index == OutputData ) { _base->_output = data->cast(); diff --git a/src/core/importexpressionmatrix_input.h b/src/core/importexpressionmatrix_input.h index 45f56b4..6bae85b 100644 --- a/src/core/importexpressionmatrix_input.h +++ b/src/core/importexpressionmatrix_input.h @@ -4,17 +4,22 @@ +/*! + * This class implements the abstract input of the import expression matrix analytic. + */ class ImportExpressionMatrix::Input : public EAbstractAnalytic::Input { Q_OBJECT public: + /*! + * Defines all input arguments for this analytic. + */ enum Argument { InputFile = 0 ,OutputData - ,NoSampleToken + ,NANToken ,SampleSize - ,TransformType ,Total }; explicit Input(ImportExpressionMatrix* parent); @@ -25,6 +30,9 @@ class ImportExpressionMatrix::Input : public EAbstractAnalytic::Input virtual void set(int index, QFile* file) override final; virtual void set(int index, EAbstractData* data) override final; private: + /*! + * Pointer to the base analytic for this object. + */ ImportExpressionMatrix* _base; }; diff --git a/src/core/pairwise_clustering.cpp b/src/core/pairwise_clustering.cpp deleted file mode 100644 index ef26772..0000000 --- a/src/core/pairwise_clustering.cpp +++ /dev/null @@ -1,172 +0,0 @@ -#include "pairwise_clustering.h" - - - -using namespace Pairwise; - - - - - - -void Clustering::initialize(ExpressionMatrix* input) -{ - // pre-allocate workspace - _workLabels.resize(input->getSampleSize()); -} - - - - - - -qint8 Clustering::compute( - const QVector& X, - int numSamples, - QVector& labels, - int minSamples, - qint8 minClusters, - qint8 maxClusters, - Criterion criterion, - bool removePreOutliers, - bool removePostOutliers) -{ - // remove pre-clustering outliers - if ( removePreOutliers ) - { - markOutliers(X, numSamples, 0, labels, 0, -7); - markOutliers(X, numSamples, 1, labels, 0, -7); - } - - // perform clustering only if there are enough samples - qint8 bestK = 0; - - if ( numSamples >= minSamples ) - { - float bestValue = INFINITY; - - for ( qint8 K = minClusters; K <= maxClusters; ++K ) - { - // run each clustering model - bool success = fit(X, numSamples, K, _workLabels); - - if ( !success ) - { - continue; - } - - // evaluate model - float value = INFINITY; - - switch (criterion) - { - case Criterion::BIC: - value = computeBIC(K, logLikelihood(), numSamples, 2); - break; - case Criterion::ICL: - value = computeICL(K, logLikelihood(), numSamples, 2, entropy()); - break; - } - - // save the best model - if ( value < bestValue ) - { - bestK = K; - bestValue = value; - - for ( int i = 0, j = 0; i < numSamples; ++i ) - { - if ( labels[i] >= 0 ) - { - labels[i] = _workLabels[j]; - ++j; - } - } - } - } - } - - if ( bestK > 1 ) - { - // remove post-clustering outliers - if ( removePostOutliers ) - { - for ( qint8 k = 0; k < bestK; ++k ) - { - markOutliers(X, numSamples, 0, labels, k, -8); - markOutliers(X, numSamples, 1, labels, k, -8); - } - } - } - - return bestK; -} - - - - - - -void Clustering::markOutliers(const QVector& X, int N, int j, QVector& labels, qint8 cluster, qint8 marker) -{ - // compute x_sorted = X[:, j], filtered and sorted - QVector x_sorted; - x_sorted.reserve(N); - - for ( int i = 0; i < N; i++ ) - { - if ( labels[i] == cluster || labels[i] == marker ) - { - x_sorted.append(X[i].s[j]); - } - } - - if ( x_sorted.size() == 0 ) - { - return; - } - - std::sort(x_sorted.begin(), x_sorted.end()); - - // compute quartiles, interquartile range, upper and lower bounds - const int n = x_sorted.size(); - - float Q1 = x_sorted[n * 1 / 4]; - float Q3 = x_sorted[n * 3 / 4]; - - float T_min = Q1 - 1.5f * (Q3 - Q1); - float T_max = Q3 + 1.5f * (Q3 - Q1); - - // mark outliers - for ( int i = 0; i < N; ++i ) - { - if ( labels[i] == cluster && (X[i].s[j] < T_min || T_max < X[i].s[j]) ) - { - labels[i] = marker; - } - } -} - - - - - - -float Clustering::computeBIC(int K, float logL, int N, int D) -{ - int p = K * (1 + D + D * D); - - return log(N) * p - 2 * logL; -} - - - - - - -float Clustering::computeICL(int K, float logL, int N, int D, float E) -{ - int p = K * (1 + D + D * D); - - return log(N) * p - 2 * logL - 2 * E; -} diff --git a/src/core/pairwise_clustering.h b/src/core/pairwise_clustering.h deleted file mode 100644 index 00aa417..0000000 --- a/src/core/pairwise_clustering.h +++ /dev/null @@ -1,48 +0,0 @@ -#ifndef PAIRWISE_CLUSTERING_H -#define PAIRWISE_CLUSTERING_H -#include - -#include "ccmatrix.h" -#include "expressionmatrix.h" -#include "pairwise_linalg.h" -#include "pairwise_index.h" - -namespace Pairwise -{ - enum class Criterion - { - BIC - ,ICL - }; - - class Clustering - { - public: - void initialize(ExpressionMatrix* input); - qint8 compute( - const QVector& X, - int numSamples, - QVector& labels, - int minSamples, - qint8 minClusters, - qint8 maxClusters, - Criterion criterion, - bool removePreOutliers, - bool removePostOutliers - ); - - protected: - virtual bool fit(const QVector& X, int N, int K, QVector& labels) = 0; - virtual float logLikelihood() const = 0; - virtual float entropy() const = 0; - - private: - void markOutliers(const QVector& X, int N, int j, QVector& labels, qint8 cluster, qint8 marker); - float computeBIC(int K, float logL, int N, int D); - float computeICL(int K, float logL, int N, int D, float E); - - QVector _workLabels; - }; -} - -#endif diff --git a/src/core/pairwise_clusteringmodel.cpp b/src/core/pairwise_clusteringmodel.cpp new file mode 100644 index 0000000..cd0104a --- /dev/null +++ b/src/core/pairwise_clusteringmodel.cpp @@ -0,0 +1,103 @@ +#include "pairwise_clusteringmodel.h" + + + +using namespace Pairwise; + + + + + + +/*! + * Construct an abstract pairwise clustering model. + * + * @param emx + */ +ClusteringModel::ClusteringModel(ExpressionMatrix* emx) +{ + // pre-allocate workspace + _workLabels.resize(emx->sampleSize()); +} + + + + + + +/*! + * Determine the number of clusters in a pairwise data array. Several sub-models, + * each one having a different number of clusters, are fit to the data and the + * sub-model with the best criterion value is selected. The data array should + * only contain samples that have a non-negative label. + * + * @param data + * @param numSamples + * @param labels + * @param minSamples + * @param minClusters + * @param maxClusters + * @param criterion + */ +qint8 ClusteringModel::compute( + const QVector& data, + int numSamples, + QVector& labels, + int minSamples, + qint8 minClusters, + qint8 maxClusters, + Criterion criterion) +{ + // perform clustering only if there are enough samples + qint8 bestK = 0; + + if ( numSamples >= minSamples ) + { + float bestValue = INFINITY; + + for ( qint8 K = minClusters; K <= maxClusters; ++K ) + { + // run each clustering sub-model + bool success = fit(data, numSamples, K, _workLabels); + + if ( !success ) + { + continue; + } + + // compute the criterion value of the sub-model + float value = INFINITY; + + switch (criterion) + { + case Criterion::AIC: + value = computeAIC(K, 2, logLikelihood()); + break; + case Criterion::BIC: + value = computeBIC(K, 2, logLikelihood(), numSamples); + break; + case Criterion::ICL: + value = computeICL(K, 2, logLikelihood(), numSamples, entropy()); + break; + } + + // keep the sub-model with the lowest criterion value + if ( value < bestValue ) + { + bestK = K; + bestValue = value; + + for ( int i = 0, j = 0; i < labels.size(); ++i ) + { + if ( labels[i] >= 0 ) + { + labels[i] = _workLabels[j]; + ++j; + } + } + } + } + } + + return bestK; +} diff --git a/src/core/pairwise_clusteringmodel.h b/src/core/pairwise_clusteringmodel.h new file mode 100644 index 0000000..467a296 --- /dev/null +++ b/src/core/pairwise_clusteringmodel.h @@ -0,0 +1,68 @@ +#ifndef PAIRWISE_CLUSTERINGMODEL_H +#define PAIRWISE_CLUSTERINGMODEL_H +#include + +#include "ccmatrix.h" +#include "expressionmatrix.h" +#include "pairwise_linalg.h" +#include "pairwise_index.h" + +namespace Pairwise +{ + /*! + * Defines the criterion types used by the abstract clustering model. + */ + enum class Criterion + { + /*! + * Akaike information criterion + */ + AIC + /*! + * Bayesian information criterion + */ + ,BIC + /*! + * Integrated completed likelihood + */ + ,ICL + }; + + /*! + * This class implements the abstract pairwise clustering model, which takes + * a pairwise data array and determines the number of clusters, as well as the + * cluster label for each sample in the data array. The number of clusters is + * determined by creating several sub-models, each with a different assumption + * of the number of clusters, and selecting the sub-model which best fits the + * data according to a criterion. The clustering sub-model must be implemented + * by the inheriting class. + */ + class ClusteringModel + { + public: + ClusteringModel(ExpressionMatrix* emx); + qint8 compute( + const QVector& data, + int numSamples, + QVector& labels, + int minSamples, + qint8 minClusters, + qint8 maxClusters, + Criterion criterion + ); + protected: + virtual bool fit(const QVector& X, int N, int K, QVector& labels) = 0; + virtual float logLikelihood() const = 0; + virtual float entropy() const = 0; + virtual float computeAIC(int K, int D, float logL) = 0; + virtual float computeBIC(int K, int D, float logL, int N) = 0; + virtual float computeICL(int K, int D, float logL, int N, float E) = 0; + private: + /*! + * Workspace for the cluster labels. + */ + QVector _workLabels; + }; +} + +#endif diff --git a/src/core/pairwise_correlation.cpp b/src/core/pairwise_correlation.cpp deleted file mode 100644 index 1edb6fa..0000000 --- a/src/core/pairwise_correlation.cpp +++ /dev/null @@ -1,26 +0,0 @@ -#include "pairwise_correlation.h" - - - -using namespace Pairwise; - - - - - - -QVector Correlation::compute( - const QVector& data, - int K, - const QVector& labels, - int minSamples) -{ - QVector correlations(K); - - for ( qint8 k = 0; k < K; ++k ) - { - correlations[k] = computeCluster(data, labels, k, minSamples); - } - - return correlations; -} diff --git a/src/core/pairwise_correlationmodel.cpp b/src/core/pairwise_correlationmodel.cpp new file mode 100644 index 0000000..fd0ee85 --- /dev/null +++ b/src/core/pairwise_correlationmodel.cpp @@ -0,0 +1,36 @@ +#include "pairwise_correlationmodel.h" + + + +using namespace Pairwise; + + + + + + +/*! + * Compute the correlation of each cluster in a pairwise data array. The data array + * should only contain the clean samples that were extracted from the expression + * matrix, while the labels should contain all samples. + * + * @param data + * @param K + * @param labels + * @param minSamples + */ +QVector CorrelationModel::compute( + const QVector& data, + int K, + const QVector& labels, + int minSamples) +{ + QVector correlations(K); + + for ( qint8 k = 0; k < K; ++k ) + { + correlations[k] = computeCluster(data, labels, k, minSamples); + } + + return correlations; +} diff --git a/src/core/pairwise_correlation.h b/src/core/pairwise_correlationmodel.h similarity index 54% rename from src/core/pairwise_correlation.h rename to src/core/pairwise_correlationmodel.h index 965246a..5025d10 100644 --- a/src/core/pairwise_correlation.h +++ b/src/core/pairwise_correlationmodel.h @@ -1,26 +1,26 @@ -#ifndef PAIRWISE_CORRELATION_H -#define PAIRWISE_CORRELATION_H +#ifndef PAIRWISE_CORRELATIONMODEL_H +#define PAIRWISE_CORRELATIONMODEL_H #include -#include "correlationmatrix.h" -#include "expressionmatrix.h" #include "pairwise_linalg.h" namespace Pairwise { - class Correlation + /*! + * This class implements the abstract pairwise correlation model, which + * takes a pairwise data array (with cluster labels) and computes a correlation + * for each cluster in the data. The correlation metric must be implemented by + * the inheriting class. + */ + class CorrelationModel { public: - virtual void initialize(ExpressionMatrix* input) = 0; - virtual QString getName() const = 0; - QVector compute( const QVector& data, int K, const QVector& labels, int minSamples ); - protected: virtual float computeCluster( const QVector& data, diff --git a/src/core/pairwise_gmm.cpp b/src/core/pairwise_gmm.cpp index bce9bef..bd8d2c1 100644 --- a/src/core/pairwise_gmm.cpp +++ b/src/core/pairwise_gmm.cpp @@ -8,18 +8,40 @@ using namespace Pairwise; +/*! + * Construct a Gaussian mixture model. + * + * @param emx + */ +GMM::GMM(ExpressionMatrix* emx): + ClusteringModel(emx) +{ +} + + + + + + +/*! + * Initialize a mixture component with the given mixture weight and mean. + * + * @param pi + * @param mu + */ void GMM::Component::initialize(float pi, const Vector2& mu) { - // initialize pi and mu as given + // initialize mixture weight and mean _pi = pi; _mu = mu; - // Use identity covariance- assume dimensions are independent + // initialize covariance to identity matrix matrixInitIdentity(_sigma); - // Initialize zero artifacts + // initialize precision to zero matrix matrixInitZero(_sigmaInv); + // initialize normalizer term to 0 _normalizer = 0; } @@ -28,12 +50,14 @@ void GMM::Component::initialize(float pi, const Vector2& mu) -void GMM::Component::prepareCovariance() +/*! + * Pre-compute the precision matrix and normalizer term for a mixture component. + */ +void GMM::Component::prepare() { const int D = 2; - // Compute inverse of Sigma once each iteration instead of - // repeatedly for each calcLogMvNorm execution. + // compute precision (inverse of covariance) float det; matrixInverse(_sigma, _sigmaInv, &det); @@ -42,7 +66,7 @@ void GMM::Component::prepareCovariance() throw std::runtime_error("matrix inverse failed"); } - // Compute normalizer for multivariate normal distribution + // compute normalizer term for multivariate normal distribution _normalizer = -0.5f * (D * log(2.0f * M_PI) + log(det)); } @@ -51,27 +75,36 @@ void GMM::Component::prepareCovariance() -void GMM::Component::calcLogMvNorm(const QVector& X, int N, float *logP) +/*! + * Compute the log of the probability density function of the multivariate normal + * distribution conditioned on a single component for each point in X: + * + * P(x|k) = exp(-0.5 * (x - mu)^T Sigma^-1 (x - mu)) / sqrt((2pi)^d det(Sigma)) + * + * Therefore the log-probability is: + * + * log(P(x|k)) = -0.5 * (x - mu)^T Sigma^-1 (x - mu) - 0.5 * (d * log(2pi) + log(det(Sigma))) + * + * @param X + * @param N + * @param logP + */ +void GMM::Component::computeLogProbNorm(const QVector& X, int N, float *logP) { - // Here we are computing the probability density function of the multivariate - // normal distribution conditioned on a single component for the set of points - // given by X. - // - // P(x|k) = exp{ -0.5 * (x - mu)^T Sigma^{-} (x - mu) } / sqrt{ (2pi)^d det(Sigma) } - for (int i = 0; i < N; ++i) { - // Let xm = (x - mu) + // compute xm = (x - mu) Vector2 xm = X[i]; vectorSubtract(xm, _mu); - // Compute xm^T Sxm = xm^T S^-1 xm + // compute Sxm = Sigma^-1 xm Vector2 Sxm; matrixProduct(_sigmaInv, xm, Sxm); + // compute xmSxm = xm^T Sigma^-1 xm float xmSxm = vectorDot(xm, Sxm); - // Compute log(P) = normalizer - 0.5 * xm^T * S^-1 * xm + // compute log(P) = normalizer - 0.5 * xm^T * Sigma^-1 * xm logP[i] = _normalizer - 0.5f * xmSxm; } } @@ -81,7 +114,14 @@ void GMM::Component::calcLogMvNorm(const QVector& X, int N, float *logP -void GMM::kmeans(const QVector& X, int N) +/*! + * Initialize the mean of each component in the mixture model using k-means + * clustering. + * + * @param X + * @param N + */ +void GMM::initializeMeans(const QVector& X, int N) { const int K = _components.size(); @@ -89,48 +129,54 @@ void GMM::kmeans(const QVector& X, int N) const float TOLERANCE = 1e-3; float diff = 0; - Vector2 MP[K]; + // initialize workspace + Vector2 Mu[K]; int counts[K]; for (int t = 0; t < MAX_ITERATIONS && diff > TOLERANCE; ++t) { - memset(MP, 0, K * sizeof(Vector2)); + // compute mean and sample count for each component + memset(Mu, 0, K * sizeof(Vector2)); memset(counts, 0, K * sizeof(int)); for (int i = 0; i < N; ++i) { - // arg min - float minD = INFINITY; - int minDk = 0; + // determine the component mean which is nearest to x_i + float min_dist = INFINITY; + int min_k = 0; for (int k = 0; k < K; ++k) { float dist = vectorDiffNorm(X[i], _components[k]._mu); - if (minD > dist) + if (min_dist > dist) { - minD = dist; - minDk = k; + min_dist = dist; + min_k = k; } } - vectorAdd(MP[minDk], X[i]); - ++counts[minDk]; + // update mean and sample count + vectorAdd(Mu[min_k], X[i]); + ++counts[min_k]; } + // scale each mean by its sample count for (int k = 0; k < K; ++k) { - vectorScale(MP[k], 1.0f / counts[k]); + vectorScale(Mu[k], 1.0f / counts[k]); } + // compute the total change of all means diff = 0; for (int k = 0; k < K; ++k) { - diff += vectorDiffNorm(MP[k], _components[k]._mu); + diff += vectorDiffNorm(Mu[k], _components[k]._mu); } diff /= K; + // update component means for (int k = 0; k < K; ++k) { - _components[k]._mu = MP[k]; + _components[k]._mu = Mu[k]; } } } @@ -140,111 +186,79 @@ void GMM::kmeans(const QVector& X, int N) -void GMM::calcLogMvNorm(const QVector& X, int N, float *loggamma) +/*! + * Perform the expectation step of the EM algorithm. In this step we compute + * gamma, the posterior probabilities for each component in the mixture model + * and each sample in X, as well as the log-likelihood of the model: + * + * log(p(x_i)) = a + log(sum(exp(log(pi_k) + log(P(x_i|k))) - a)) + * + * gamma_ki = exp(log(pi_k) + log(P(x_i|k)) - log(p(x_i))) + * + * log(L) = sum(log(p(x_i))) + * + * @param X + * @param N + * @param gamma + */ +float GMM::computeEStep(const QVector& X, int N, float *gamma) { const int K = _components.size(); - for ( int k = 0; k < K; ++k ) + // compute logpi + float logpi[K]; + + for (int k = 0; k < K; ++k) { - _components[k].calcLogMvNorm(X, N, &loggamma[k * N]); + logpi[k] = log(_components[k]._pi); } -} - - + // compute the log-probability for each component and each point in X + float *logProb = gamma; + for ( int k = 0; k < K; ++k ) + { + _components[k].computeLogProbNorm(X, N, &logProb[k * N]); + } + // compute gamma and log-likelihood + float logL = 0.0; -void GMM::calcLogLikelihoodAndGammaNK(const float *logpi, int K, float *loggamma, int N, float *logL) -{ - *logL = 0.0; for (int i = 0; i < N; ++i) { + // compute a = argmax(logpi_k + logProb_ki, k) float maxArg = -INFINITY; for (int k = 0; k < K; ++k) { - const float logProbK = logpi[k] + loggamma[k * N + i]; - if (logProbK > maxArg) + float arg = logpi[k] + logProb[k * N + i]; + if (maxArg < arg) { - maxArg = logProbK; + maxArg = arg; } } + // compute logpx float sum = 0.0; for (int k = 0; k < K; ++k) { - const float logProbK = logpi[k] + loggamma[k * N + i]; - sum += exp(logProbK - maxArg); + sum += exp(logpi[k] + logProb[k * N + i] - maxArg); } - const float logpx = maxArg + log(sum); - *logL += logpx; - for (int k = 0; k < K; ++k) - { - loggamma[k * N + i] += -logpx; - } - } -} - - - - - - -void GMM::calcLogGammaK(const float *loggamma, int N, int K, float *logGamma) -{ - memset(logGamma, 0, K * sizeof(float)); - - for (int k = 0; k < K; ++k) - { - const float *loggammak = &loggamma[k * N]; - - float maxArg = -INFINITY; - for (int i = 0; i < N; ++i) - { - const float loggammank = loggammak[i]; - if (loggammank > maxArg) - { - maxArg = loggammank; - } - } - - float sum = 0; - for (int i = 0; i < N; ++i) - { - const float loggammank = loggammak[i]; - sum += exp(loggammank - maxArg); - } - - logGamma[k] = maxArg + log(sum); - } -} - - - - - + float logpx = maxArg + log(sum); -float GMM::calcLogGammaSum(const float *logpi, int K, const float *logGamma) -{ - float maxArg = -INFINITY; - for (int k = 0; k < K; ++k) - { - const float arg = logpi[k] + logGamma[k]; - if (arg > maxArg) + // compute gamma_ki + for (int k = 0; k < K; ++k) { - maxArg = arg; + gamma[k * N + i] += logpi[k] - logpx; + gamma[k * N + i] = exp(gamma[k * N + i]); } - } - float sum = 0; - for (int k = 0; k < K; ++k) - { - const float arg = logpi[k] + logGamma[k]; - sum += exp(arg - maxArg); + // update log-likelihood + logL += logpx; } - return maxArg + log(sum); + // return log-likelihood + return logL; } @@ -252,66 +266,74 @@ float GMM::calcLogGammaSum(const float *logpi, int K, const float *logGamma) -void GMM::performMStep(float *logpi, int K, float *loggamma, float *logGamma, float logGammaSum, const QVector& X, int N) +/*! + * Perform the maximization step of the EM algorithm. In this step we update the + * parameters of the the mixture model using gamma, which is computed during the + * expectation step: + * + * n_k = sum(gamma_ki) + * + * pi_k = n_k / N + * + * mu_k = sum(gamma_ki * x_i)) / n_k + * + * Sigma_k = sum(gamma_ki * (x_i - mu_k) * (x_i - mu_k)^T) / n_k + * + * @param X + * @param N + * @param gamma + */ +void GMM::computeMStep(const QVector& X, int N, const float *gamma) { - // update pi - for (int k = 0; k < K; ++k) - { - logpi[k] += logGamma[k] - logGammaSum; - - _components[k]._pi = exp(logpi[k]); - } + const int K = _components.size(); - // convert loggamma / logGamma to gamma / Gamma to avoid duplicate exp(x) calls for (int k = 0; k < K; ++k) { + // compute n_k = sum(gamma_ki) + float n_k = 0; + for (int i = 0; i < N; ++i) { - const int idx = k * N + i; - loggamma[idx] = exp(loggamma[idx]); + n_k += gamma[k * N + i]; } - } - for (int k = 0; k < K; ++k) - { - logGamma[k] = exp(logGamma[k]); - } + // update mixture weight + _components[k]._pi = n_k / N; - for (int k = 0; k < K; ++k) - { - // Update mu + // update mean Vector2& mu = _components[k]._mu; vectorInitZero(mu); for (int i = 0; i < N; ++i) { - vectorAdd(mu, loggamma[k * N + i], X[i]); + vectorAdd(mu, gamma[k * N + i], X[i]); } - vectorScale(mu, 1.0f / logGamma[k]); + vectorScale(mu, 1.0f / n_k); - // Update sigma + // update covariance matrix Matrix2x2& sigma = _components[k]._sigma; matrixInitZero(sigma); for (int i = 0; i < N; ++i) { - // xm = (x - mu) + // compute xm = (x_i - mu_k) Vector2 xm = X[i]; vectorSubtract(xm, mu); - // S_i = gamma_ik * (x - mu) (x - mu)^T + // compute Sigma_ki = gamma_ki * (x_i - mu_k) (x_i - mu_k)^T Matrix2x2 outerProduct; matrixOuterProduct(xm, xm, outerProduct); - matrixAdd(sigma, loggamma[k * N + i], outerProduct); + matrixAdd(sigma, gamma[k * N + i], outerProduct); } - matrixScale(sigma, 1.0f / logGamma[k]); + matrixScale(sigma, 1.0f / n_k); - _components[k].prepareCovariance(); + // pre-compute precision matrix and normalizer term + _components[k].prepare(); } } @@ -320,22 +342,34 @@ void GMM::performMStep(float *logpi, int K, float *loggamma, float *logGamma, fl -void GMM::calcLabels(float *loggamma, int N, int K, QVector& labels) +/*! + * Compute the cluster labels of a dataset using gamma: + * + * y_i = argmax(gamma_ki, k) + * + * @param gamma + * @param N + * @param K + * @param labels + */ +void GMM::computeLabels(const float *gamma, int N, int K, QVector& labels) { for ( int i = 0; i < N; ++i ) { + // determine the value k for which gamma_ki is highest int max_k = -1; float max_gamma = -INFINITY; for ( int k = 0; k < K; ++k ) { - if ( max_gamma < loggamma[k * N + i] ) + if ( max_gamma < gamma[k * N + i] ) { max_k = k; - max_gamma = loggamma[k * N + i]; + max_gamma = gamma[k * N + i]; } } + // assign x_i to cluster k labels[i] = max_k; } } @@ -345,7 +379,17 @@ void GMM::calcLabels(float *loggamma, int N, int K, QVector& labels) -float GMM::calcEntropy(float *loggamma, int N, const QVector& labels) +/*! + * Compute the entropy of the mixture model for a dataset using gamma + * and the given cluster labels: + * + * E = sum(sum(z_ki * log(gamma_ki))), z_ki = (y_i == k) + * + * @param gamma + * @param N + * @param labels + */ +float GMM::computeEntropy(const float *gamma, int N, const QVector& labels) { float E = 0; @@ -353,7 +397,7 @@ float GMM::calcEntropy(float *loggamma, int N, const QVector& labels) { int k = labels[i]; - E += log(loggamma[k * N + i]); + E += log(gamma[k * N + i]); } return E; @@ -364,6 +408,15 @@ float GMM::calcEntropy(float *loggamma, int N, const QVector& labels) +/*! + * Fit the mixture model to a pairwise data array and compute the output cluster + * labels for the data. The data array should only contain clean samples. + * + * @param X + * @param N + * @param K + * @param labels + */ bool GMM::fit(const QVector& X, int N, int K, QVector& labels) { // initialize components @@ -371,64 +424,48 @@ bool GMM::fit(const QVector& X, int N, int K, QVector& labels) for ( int k = 0; k < K; ++k ) { - // use uniform mixture proportion and randomly sampled mean + // use uniform mixture weight and randomly sampled mean int i = rand() % N; _components[k].initialize(1.0f / K, X[i]); - _components[k].prepareCovariance(); + _components[k].prepare(); } // initialize means with k-means - kmeans(X, N); + initializeMeans(X, N); // initialize workspace - float *logpi = new float[K]; - float *loggamma = new float[K * N]; - float *logGamma = new float[K]; - - for (int k = 0; k < K; ++k) - { - logpi[k] = log(_components[k]._pi); - } + float *gamma = new float[K * N]; // run EM algorithm const int MAX_ITERATIONS = 100; const float TOLERANCE = 1e-8; float prevLogL = -INFINITY; - float currentLogL = -INFINITY; + float currLogL = -INFINITY; bool success; try { for ( int t = 0; t < MAX_ITERATIONS; ++t ) { - // E step - // compute gamma, log-likelihood - calcLogMvNorm(X, N, loggamma); - - prevLogL = currentLogL; - calcLogLikelihoodAndGammaNK(logpi, K, loggamma, N, ¤tLogL); + // perform E step + prevLogL = currLogL; + currLogL = computeEStep(X, N, gamma); // check for convergence - if ( fabs(currentLogL - prevLogL) < TOLERANCE ) + if ( fabs(currLogL - prevLogL) < TOLERANCE ) { break; } - // M step - // Let Gamma[k] = \Sum_i gamma[k, i] - calcLogGammaK(loggamma, N, K, logGamma); - - float logGammaSum = calcLogGammaSum(logpi, K, logGamma); - - // Update parameters - performMStep(logpi, K, loggamma, logGamma, logGammaSum, X, N); + // perform M step + computeMStep(X, N, gamma); } // save outputs - _logL = currentLogL; - calcLabels(loggamma, N, K, labels); - _entropy = calcEntropy(loggamma, N, labels); + _logL = currLogL; + computeLabels(gamma, N, K, labels); + _entropy = computeEntropy(gamma, N, labels); success = true; } @@ -437,9 +474,67 @@ bool GMM::fit(const QVector& X, int N, int K, QVector& labels) success = false; } - delete[] logpi; - delete[] loggamma; - delete[] logGamma; + delete[] gamma; return success; } + + + + + + +/*! + * Compute the Akaike Information Criterion of a Gaussian mixture model. + * + * @param K + * @param D + * @param logL + */ +float GMM::computeAIC(int K, int D, float logL) +{ + int p = K * (1 + D + D * D); + + return 2 * p - 2 * logL; +} + + + + + + +/*! + * Compute the Bayesian Information Criterion of a Gaussian mixture model. + * + * @param K + * @param D + * @param logL + * @param N + */ +float GMM::computeBIC(int K, int D, float logL, int N) +{ + int p = K * (1 + D + D * D); + + return log(N) * p - 2 * logL; +} + + + + + + +/*! + * Compute the Integrated Completed Likelihood of a Gaussian mixture model. + * + * @param K + * @param D + * @param logL + * @param N + * @param E + */ +float GMM::computeICL(int K, int D, float logL, int N, float E) +{ + int p = K * (1 + D + D * D); + + return log(N) * p - 2 * logL - 2 * E; +} diff --git a/src/core/pairwise_gmm.h b/src/core/pairwise_gmm.h index e097b58..61a0de4 100644 --- a/src/core/pairwise_gmm.h +++ b/src/core/pairwise_gmm.h @@ -1,50 +1,73 @@ #ifndef PAIRWISE_GMM_H #define PAIRWISE_GMM_H -#include "pairwise_clustering.h" +#include "pairwise_clusteringmodel.h" namespace Pairwise { - class GMM : public Clustering + /*! + * This class implements the Gaussian mixture model. + */ + class GMM : public ClusteringModel { public: - GMM() = default; - + GMM(ExpressionMatrix* emx); + public: class Component { public: Component() = default; - void initialize(float pi, const Vector2& mu); - void prepareCovariance(); - void calcLogMvNorm(const QVector& X, int N, float *logP); - + void prepare(); + void computeLogProbNorm(const QVector& X, int N, float *logP); + public: + /*! + * The mixture weight. + */ float _pi; + /*! + * The mean. + */ Vector2 _mu; + /*! + * The covariance matrix. + */ Matrix2x2 _sigma; - private: + /*! + * The precision matrix, or inverse of the covariance matrix. + */ Matrix2x2 _sigmaInv; + /*! + * A normalization term which is pre-computed for the multivariate + * normal distribution function. + */ float _normalizer; }; - protected: bool fit(const QVector& X, int N, int K, QVector& labels); - float logLikelihood() const { return _logL; } float entropy() const { return _entropy; } - + float computeAIC(int K, int D, float logL); + float computeBIC(int K, int D, float logL, int N); + float computeICL(int K, int D, float logL, int N, float E); private: - void kmeans(const QVector& X, int N); - void calcLogMvNorm(const QVector& X, int N, float *loggamma); - void calcLogLikelihoodAndGammaNK(const float *logpi, int K, float *loggamma, int N, float *logL); - void calcLogGammaK(const float *loggamma, int N, int K, float *logGamma); - float calcLogGammaSum(const float *logpi, int K, const float *logGamma); - void performMStep(float *logpi, int K, float *loggamma, float *logGamma, float logGammaSum, const QVector& X, int N); - void calcLabels(float *loggamma, int N, int K, QVector& labels); - float calcEntropy(float *loggamma, int N, const QVector& labels); - + void initializeMeans(const QVector& X, int N); + float computeEStep(const QVector& X, int N, float *gamma); + void computeMStep(const QVector& X, int N, const float *gamma); + void computeLabels(const float *gamma, int N, int K, QVector& labels); + float computeEntropy(const float *gamma, int N, const QVector& labels); + /*! + * The list of mixture components, which define the mean and covariance + * of each cluster in the mixture model. + */ QVector _components; + /*! + * The log-likelihood of the mixture model. + */ float _logL; + /*! + * The entropy of the mixture model. + */ float _entropy; }; } diff --git a/src/core/pairwise_index.cpp b/src/core/pairwise_index.cpp index 8a45c76..094f03d 100644 --- a/src/core/pairwise_index.cpp +++ b/src/core/pairwise_index.cpp @@ -10,10 +10,19 @@ using namespace Pairwise; +/*! + * Construct a pairwise index from a row index and a column index. The row + * index must be greater than the column index. + * + * @param x + * @param y + */ Index::Index(qint32 x, qint32 y): _x(x), _y(y) { + EDEBUG_FUNC(this,x,y); + // make sure pairwise index is valid if ( x < 1 || y < 0 || x <= y ) { @@ -29,10 +38,16 @@ Index::Index(qint32 x, qint32 y): -Index::Index(qint64 index): - _x(1), - _y(0) +/*! + * Construct a pairwise index from a one-dimensional index, which corresponds + * to the i-th element in the lower triangle of a matrix using row-major order. + * + * @param index + */ +Index::Index(qint64 index) { + EDEBUG_FUNC(this,index); + // make sure index is valid if ( index < 0 ) { @@ -44,15 +59,15 @@ Index::Index(qint64 index): // compute pairwise index from scalar index qint64 pos {0}; - while ( pos <= index ) + qint64 x {0}; + + while ( pos + x <= index ) { - ++_x; - pos = _x * (_x - 1) / 2; + pos += x; + ++x; } - --_x; - pos = _x * (_x - 1) / 2; - + _x = x; _y = index - pos; } @@ -61,8 +76,15 @@ Index::Index(qint64 index): +/*! + * Return the indent value of this pairwise index with a given cluster index. + * + * @param cluster + */ qint64 Index::indent(qint8 cluster) const { + EDEBUG_FUNC(this,cluster); + // make sure cluster given is valid if ( cluster < 0 || cluster >= MAX_CLUSTER_SIZE ) { @@ -82,8 +104,13 @@ qint64 Index::indent(qint8 cluster) const +/*! + * Increment a pairwise index to the next element. + */ void Index::operator++() { + EDEBUG_FUNC(this); + // increment gene y and check if it reaches gene x if ( ++_y >= _x ) { @@ -92,16 +119,3 @@ void Index::operator++() ++_x; } } - - - - - - -Index Index::operator++(int) -{ - // save index value, increment it, and return previous value - Index ret {*this}; - ++(*this); - return ret; -} diff --git a/src/core/pairwise_index.h b/src/core/pairwise_index.h index e9ab485..befaca7 100644 --- a/src/core/pairwise_index.h +++ b/src/core/pairwise_index.h @@ -6,6 +6,16 @@ namespace Pairwise { + /*! + * This class implements the pairwise index, which provides a way to order + * elements in a pairwise matrix and iterate through them. The pairwise index + * uses row-major order and uses only the lower triangle of a matrix; that is, + * it assumes that the row index is always greater than the column index. + * Additionally, the pairwise index provides an "indent" value which can be + * used to rank pairs that also have a cluster index; this value requires a + * fixed upper bound on the number of clusters, which depends on the data + * objects that use this class. + */ class Index { public: @@ -20,7 +30,6 @@ namespace Pairwise Index& operator=(const Index&) = default; Index& operator=(Index&&) = default; void operator++(); - Index operator++(int); bool operator==(const Index& object) const { return _x == object._x && _y == object._y; } bool operator!=(const Index& object) @@ -33,9 +42,21 @@ namespace Pairwise { return !(*this <= object); } bool operator>=(const Index& object) { return !(*this < object); } + /*! + * The maximum number of clusters used to compute the indent value + * of a pairwise index. Data objects which use the pairwise index should + * never attempt to store more than this number of clusters in a single + * pair. + */ constexpr static qint8 MAX_CLUSTER_SIZE {64}; private: + /*! + * The row index. + */ qint32 _x {1}; + /*! + * The column index. + */ qint32 _y {0}; }; } diff --git a/src/core/pairwise_kmeans.cpp b/src/core/pairwise_kmeans.cpp deleted file mode 100644 index 34ecb8c..0000000 --- a/src/core/pairwise_kmeans.cpp +++ /dev/null @@ -1,127 +0,0 @@ -#include "pairwise_kmeans.h" - - - -using namespace Pairwise; - - - - - - -bool KMeans::fit(const QVector& X, int N, int K, QVector& labels) -{ - const int NUM_INITS = 10; - const int MAX_ITERATIONS = 300; - - // repeat with several initializations - _logL = -INFINITY; - - for ( int init = 0; init < NUM_INITS; ++init ) - { - // initialize means randomly from X - _means.resize(K); - - for ( int k = 0; k < K; ++k ) - { - int i = rand() % N; - _means[k] = X[i]; - } - - // iterate K means until convergence - QVector y(N); - QVector y_next(N); - - for ( int t = 0; t < MAX_ITERATIONS; ++t ) - { - // compute new labels - for ( int i = 0; i < N; ++i ) - { - // find k that minimizes norm(x_i - mu_k) - int min_k = -1; - float min_dist; - - for ( int k = 0; k < K; ++k ) - { - float dist = vectorDiffNorm(X[i], _means[k]); - - if ( min_k == -1 || dist < min_dist ) - { - min_k = k; - min_dist = dist; - } - } - - y_next[i] = min_k; - } - - // check for convergence - if ( y == y_next ) - { - break; - } - - // update labels - std::swap(y, y_next); - - // update means - for ( int k = 0; k < K; ++k ) - { - // compute mu_k = mean of all x_i in cluster k - int n_k = 0; - - vectorInitZero(_means[k]); - - for ( int i = 0; i < N; ++i ) - { - if ( y[i] == k ) - { - vectorAdd(_means[k], X[i]); - n_k++; - } - } - - vectorScale(_means[k], 1.0f / n_k); - } - } - - // save the run with the greatest log-likelihood - float logL = computeLogLikelihood(X, N, y); - - if ( _logL < logL ) - { - _logL = logL; - std::swap(labels, y); - } - } - - return true; -} - - - - - - -float KMeans::computeLogLikelihood(const QVector& X, int N, const QVector& y) -{ - // compute within-class scatter - float S = 0; - - for ( int k = 0; k < _means.size(); ++k ) - { - for ( int i = 0; i < N; ++i ) - { - if ( y[i] != k ) - { - continue; - } - - float dist = vectorDiffNorm(X[i], _means[k]); - - S += dist * dist; - } - } - - return -S; -} diff --git a/src/core/pairwise_kmeans.h b/src/core/pairwise_kmeans.h deleted file mode 100644 index 645cea6..0000000 --- a/src/core/pairwise_kmeans.h +++ /dev/null @@ -1,26 +0,0 @@ -#ifndef PAIRWISE_KMEANS_H -#define PAIRWISE_KMEANS_H -#include "pairwise_clustering.h" - -namespace Pairwise -{ - class KMeans : public Clustering - { - public: - KMeans() = default; - - protected: - bool fit(const QVector& X, int N, int K, QVector& labels); - - float logLikelihood() const { return _logL; } - float entropy() const { return 0; } - - private: - float computeLogLikelihood(const QVector& X, int N, const QVector& y); - - QVector _means; - float _logL; - }; -} - -#endif diff --git a/src/core/pairwise_linalg.cpp b/src/core/pairwise_linalg.cpp index aaf6bb2..1d63d34 100644 --- a/src/core/pairwise_linalg.cpp +++ b/src/core/pairwise_linalg.cpp @@ -9,6 +9,13 @@ namespace Pairwise { +/*! + * Return the i.j element of a matrix. + * + * @param M + * @param i + * @param j + */ inline const float& elem(const Matrix2x2& M, int i, int j) { return M.s[i * 2 + j]; @@ -19,6 +26,13 @@ inline const float& elem(const Matrix2x2& M, int i, int j) +/*! + * Return the i.j element of a matrix. + * + * @param M + * @param i + * @param j + */ inline float& elem(Matrix2x2& M, int i, int j) { return M.s[i * 2 + j]; @@ -29,6 +43,11 @@ inline float& elem(Matrix2x2& M, int i, int j) +/*! + * Initialize a vector to the zero vector. + * + * @param a + */ void vectorInitZero(Vector2& a) { a.s[0] = 0; @@ -40,6 +59,12 @@ void vectorInitZero(Vector2& a) +/*! + * Add two vectors in-place. The result is stored in a. + * + * @param a + * @param b + */ void vectorAdd(Vector2& a, const Vector2& b) { a.s[0] += b.s[0]; @@ -51,6 +76,14 @@ void vectorAdd(Vector2& a, const Vector2& b) +/*! + * Add two vectors in-place. The vector b is scaled by a constant c, and the + * result is stored in a. + * + * @param a + * @param c + * @param b + */ void vectorAdd(Vector2& a, float c, const Vector2& b) { a.s[0] += c * b.s[0]; @@ -62,6 +95,12 @@ void vectorAdd(Vector2& a, float c, const Vector2& b) +/*! + * Subtract two vectors in-place. The result is stored in a. + * + * @param a + * @param b + */ void vectorSubtract(Vector2& a, const Vector2& b) { a.s[0] -= b.s[0]; @@ -73,6 +112,12 @@ void vectorSubtract(Vector2& a, const Vector2& b) +/*! + * Scale a vector by a constant. + * + * @param a + * @param c + */ void vectorScale(Vector2& a, float c) { a.s[0] *= c; @@ -84,6 +129,12 @@ void vectorScale(Vector2& a, float c) +/*! + * Return the dot product of two vectors. + * + * @param a + * @param b + */ float vectorDot(const Vector2& a, const Vector2& b) { return a.s[0] * b.s[0] + a.s[1] * b.s[1]; @@ -94,6 +145,12 @@ float vectorDot(const Vector2& a, const Vector2& b) +/*! + * Return the Euclidean distance between two vectors. + * + * @param a + * @param b + */ float vectorDiffNorm(const Vector2& a, const Vector2& b) { float dist = 0; @@ -108,6 +165,11 @@ float vectorDiffNorm(const Vector2& a, const Vector2& b) +/*! + * Initialize a matrix to the identity matrix. + * + * @param M + */ void matrixInitIdentity(Matrix2x2& M) { elem(M, 0, 0) = 1; @@ -121,6 +183,11 @@ void matrixInitIdentity(Matrix2x2& M) +/*! + * Initialize a matrix to the zero matrix. + * + * @param M + */ void matrixInitZero(Matrix2x2& M) { elem(M, 0, 0) = 0; @@ -134,6 +201,14 @@ void matrixInitZero(Matrix2x2& M) +/*! + * Add two matrices in place. The matrix B is scaled by a constant c, and the + * result is stored in A. + * + * @param A + * @param c + * @param B + */ void matrixAdd(Matrix2x2& A, float c, const Matrix2x2& B) { elem(A, 0, 0) += c * elem(B, 0, 0); @@ -147,6 +222,12 @@ void matrixAdd(Matrix2x2& A, float c, const Matrix2x2& B) +/*! + * Scale a matrix by a constant. + * + * @param M + * @param c + */ void matrixScale(Matrix2x2& A, float c) { elem(A, 0, 0) *= c; @@ -160,6 +241,14 @@ void matrixScale(Matrix2x2& A, float c) +/*! + * Compute the inverse of A and store the result in B. Additionally, the + * determinant is returned as a pointer argument. + * + * @param A + * @param B + * @param p_det + */ void matrixInverse(const Matrix2x2& A, Matrix2x2& B, float *p_det) { float det = elem(A, 0, 0) * elem(A, 1, 1) - elem(A, 0, 1) * elem(A, 1, 0); @@ -177,6 +266,13 @@ void matrixInverse(const Matrix2x2& A, Matrix2x2& B, float *p_det) +/*! + * Compute the matrix-vector product A * x and store the result in b. + * + * @param A + * @param x + * @param b + */ void matrixProduct(const Matrix2x2& A, const Vector2& x, Vector2& b) { b.s[0] = elem(A, 0, 0) * x.s[0] + elem(A, 0, 1) * x.s[1]; @@ -188,6 +284,13 @@ void matrixProduct(const Matrix2x2& A, const Vector2& x, Vector2& b) +/*! + * Compute the outer product a * b^T and store the result in C. + * + * @param a + * @param b + * @param C + */ void matrixOuterProduct(const Vector2& a, const Vector2& b, Matrix2x2& C) { elem(C, 0, 0) = a.s[0] * b.s[0]; diff --git a/src/core/pairwise_linalg.h b/src/core/pairwise_linalg.h index b135e87..70109df 100644 --- a/src/core/pairwise_linalg.h +++ b/src/core/pairwise_linalg.h @@ -2,6 +2,13 @@ #define PAIRWISE_LINALG_H #include +/*! + * This file provides structure and function definitions for the Vector2 and + * Matrix2x2 types, which are vector and matrix types with fixed dimensions. + * The operations defined for these types compute outputs directly without the + * use of loops. These types are useful for any algorithm that operates on + * pairwise data. + */ namespace Pairwise { typedef union { diff --git a/src/core/pairwise_matrix.cpp b/src/core/pairwise_matrix.cpp index ae8d82c..9386730 100644 --- a/src/core/pairwise_matrix.cpp +++ b/src/core/pairwise_matrix.cpp @@ -9,9 +9,16 @@ using namespace Pairwise; +/*! + * Return the index of the first byte in this data object after the end of + * the data section. Defined as the size of the header and sub-header plus the + * total size of all pairs. + */ qint64 Matrix::dataEnd() const { - return _headerSize + _offset + _clusterSize * (_dataSize + _itemHeaderSize); + EDEBUG_FUNC(this); + + return _headerSize + _subHeaderSize + _clusterSize * (_dataSize + _itemHeaderSize); } @@ -19,11 +26,20 @@ qint64 Matrix::dataEnd() const +/*! + * Read in the data of an existing data object that was just opened. + */ void Matrix::readData() { - // read header + EDEBUG_FUNC(this); + + // seek to the beginning of the data seek(0); - stream() >> _geneSize >> _maxClusterSize >> _dataSize >> _pairSize >> _clusterSize >> _offset; + + // read the header + stream() >> _geneSize >> _maxClusterSize >> _dataSize >> _pairSize >> _clusterSize >> _subHeaderSize; + + // read the sub-header readHeader(); } @@ -32,14 +48,23 @@ void Matrix::readData() +/*! + * Initialize this data object's data to a null state. + */ void Matrix::writeNewData() { + EDEBUG_FUNC(this); + // initialize metadata - setMeta(EMetadata(EMetadata::Object)); + setMeta(EMetaObject()); - // initialize header + // seek to the beginning of the data seek(0); - stream() << _geneSize << _maxClusterSize << _dataSize << _pairSize << _clusterSize << _offset; + + // write the header + stream() << _geneSize << _maxClusterSize << _dataSize << _pairSize << _clusterSize << _subHeaderSize; + + // write the sub-header writeHeader(); } @@ -48,11 +73,21 @@ void Matrix::writeNewData() +/*! + * Finalize this data object's data after the analytic that created it has + * finished giving it new data. + */ void Matrix::finish() { - // initialize header + EDEBUG_FUNC(this); + + // seek to the beginning of the data seek(0); - stream() << _geneSize << _maxClusterSize << _dataSize << _pairSize << _clusterSize << _offset; + + // write the header + stream() << _geneSize << _maxClusterSize << _dataSize << _pairSize << _clusterSize << _subHeaderSize; + + // write the sub-header writeHeader(); } @@ -61,9 +96,14 @@ void Matrix::finish() -EMetadata Matrix::geneNames() const +/*! + * Return the list of gene names in this pairwise matrix. + */ +EMetaArray Matrix::geneNames() const { - return meta().toObject().at("genes"); + EDEBUG_FUNC(this); + + return meta().toObject().at("genes").toArray(); } @@ -71,19 +111,30 @@ EMetadata Matrix::geneNames() const -void Matrix::initialize(const EMetadata& geneNames, int maxClusterSize, int dataSize, int offset) +/*! + * Initialize this pairwise matrix with a list of gene names, the max cluster + * size, the pairwise data size, and the sub-header size. + * + * @param geneNames + * @param maxClusterSize + * @param dataSize + * @param subHeaderSize + */ +void Matrix::initialize(const EMetaArray& geneNames, int maxClusterSize, int dataSize, int subHeaderSize) { - // make sure gene names metadata is an array and is not empty - if ( !geneNames.isArray() || geneNames.toArray().isEmpty() ) + EDEBUG_FUNC(this,&geneNames,maxClusterSize,dataSize,subHeaderSize); + + // make sure gene names metadata is not empty + if ( geneNames.isEmpty() ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("Domain Error")); - e.setDetails(tr("Gene names metadata is not an array or is empty.")); + e.setDetails(tr("Gene names metadata is empty.")); throw e; } // make sure arguments are valid - if ( maxClusterSize < 1 || dataSize < 1 || offset < 0 ) + if ( maxClusterSize < 1 || dataSize < 1 || subHeaderSize < 0 ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("Pairwise Matrix Initialization Error")); @@ -105,10 +156,10 @@ void Matrix::initialize(const EMetadata& geneNames, int maxClusterSize, int data setMeta(metaObject); // initiailze new data within object - _geneSize = geneNames.toArray().size(); + _geneSize = geneNames.size(); _maxClusterSize = maxClusterSize; _dataSize = dataSize; - _offset = offset; + _subHeaderSize = subHeaderSize; _pairSize = 0; _clusterSize = 0; _lastWrite = -1; @@ -119,8 +170,16 @@ void Matrix::initialize(const EMetadata& geneNames, int maxClusterSize, int data -void Matrix::write(Index index, qint8 cluster) +/*! + * Write the header of a new pair given a pairwise index and cluster index. + * + * @param index + * @param cluster + */ +void Matrix::write(const Index& index, qint8 cluster) { + EDEBUG_FUNC(this,&index,cluster); + // make sure this is new data object that can be written to if ( _lastWrite == -2 ) { @@ -142,7 +201,7 @@ void Matrix::write(Index index, qint8 cluster) } // seek to position for next pair and write indent value - seek(_headerSize + _offset + _clusterSize * (_dataSize + _itemHeaderSize)); + seek(_headerSize + _subHeaderSize + _clusterSize * (_dataSize + _itemHeaderSize)); stream() << index.getX() << index.getY() << cluster; // increment cluster size and set new last index @@ -155,9 +214,18 @@ void Matrix::write(Index index, qint8 cluster) +/*! + * Get a pair at the given index in the data object file and return the + * pairwise index and cluster index of that pair. + * + * @param index + * @param cluster + */ Index Matrix::getPair(qint64 index, qint8* cluster) const { - // seek to pairwise index and read item header data + EDEBUG_FUNC(this,index,cluster); + + // seek to index and read item header data seekPair(index); qint32 geneX; qint32 geneY; @@ -172,8 +240,17 @@ Index Matrix::getPair(qint64 index, qint8* cluster) const +/*! + * Find a pair with a given indent value using binary search. + * + * @param indent + * @param first + * @param last + */ qint64 Matrix::findPair(qint64 indent, qint64 first, qint64 last) const { + EDEBUG_FUNC(this,indent,first,last); + // calculate the midway pivot point and seek to it qint64 pivot {first + (last - first)/2}; seekPair(pivot); @@ -227,8 +304,15 @@ qint64 Matrix::findPair(qint64 indent, qint64 first, qint64 last) const +/*! + * Seek to the pair at the given index in the data object file. + * + * @param index + */ void Matrix::seekPair(qint64 index) const { + EDEBUG_FUNC(this,index); + // make sure index is within range if ( index < 0 || index >= _clusterSize ) { @@ -239,119 +323,6 @@ void Matrix::seekPair(qint64 index) const throw e; } - // seek to pairwise index requested making sure it worked - seek(_headerSize + _offset + index * (_dataSize + _itemHeaderSize)); -} - - - - - - -void Matrix::Pair::write(Index index) -{ - // make sure cluster size of pair does not exceed max - if ( clusterSize() > _matrix->_maxClusterSize ) - { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("Pairwise Logical Error")); - e.setDetails(tr("Cannot write pair with cluster size %1 exceeding the max of %2.") - .arg(clusterSize()) - .arg(_matrix->_maxClusterSize)); - throw e; - } - - // go through each cluster and write it to data object - for (int i = 0; i < clusterSize() ;++i) - { - _matrix->write(index,i); - writeCluster(_matrix->stream(),i); - } - - // increment pair size of data object - ++(_matrix->_pairSize); -} - - - - - - -void Matrix::Pair::read(Index index) const -{ - // clear any existing clusters - clearClusters(); - - // attempt to find cluster index within data object - qint64 clusterIndex; - if ( _cMatrix->_clusterSize > 0 - && (clusterIndex = _cMatrix->findPair(index.indent(0),0,_cMatrix->_clusterSize - 1)) != -1 ) - { - // pair found, read in all clusters - _rawIndex = clusterIndex; - readNext(); - } -} - - - - - - -void Matrix::Pair::readNext() const -{ - // make sure read next index is not already at end of data object - if ( _rawIndex < _cMatrix->_clusterSize ) - { - // clear any existing clusters - clearClusters(); - - // get to first cluster - qint8 cluster; - Index index {_cMatrix->getPair(_rawIndex++,&cluster)}; - - // make sure this is cluster 0 - if ( cluster != 0 ) - { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("File IO Error")); - e.setDetails(tr("Reading pair failed because first cluster is not 0.")); - throw e; - } - - // add first cluster, read it in, and save pairwise index - addCluster(); - readCluster(_cMatrix->stream(),0); - _index = index; - - // read in remaining clusters for pair - qint8 count {1}; - while ( _rawIndex < _cMatrix->_clusterSize ) - { - // get next pair cluster - _cMatrix->getPair(_rawIndex++,&cluster); - - // if cluster is zero this is the next pair so break from loop - if ( cluster == 0 ) - { - --_rawIndex; - break; - } - - // make sure max cluster size has not been exceeded - if ( ++count > _cMatrix->_maxClusterSize ) - { - E_MAKE_EXCEPTION(e); - e.setTitle(tr("Pairwise Logical Error")); - e.setDetails(tr("Cannot read pair with cluster size %1 exceeding the max of %2.") - .arg(count) - .arg(_matrix->_maxClusterSize)); - throw e; - } - - // add new cluster and read it in - addCluster(); - readCluster(_cMatrix->stream(),cluster); - } - } + // seek to the specified index + seek(_headerSize + _subHeaderSize + index * (_dataSize + _itemHeaderSize)); } diff --git a/src/core/pairwise_matrix.h b/src/core/pairwise_matrix.h index f8c8ab7..21d0e59 100644 --- a/src/core/pairwise_matrix.h +++ b/src/core/pairwise_matrix.h @@ -1,5 +1,5 @@ -#ifndef PAIRWISE_BASE_H -#define PAIRWISE_BASE_H +#ifndef PAIRWISE_MATRIX_H +#define PAIRWISE_MATRIX_H #include #include "pairwise_index.h" @@ -8,76 +8,80 @@ namespace Pairwise { + /*! + * This class implements the abstract pairwise matrix data object, which can + * be extended to represent any pairwise matrix. Both the rows and columns + * correspond to genes, and each element (i, j) in the matrix contains + * pairwise data for genes i and j. This pairwise data can have multiple clusters, + * and the structure of a "pair-cluster" is defined by the inheriting class. + * This class stores matrix data as an ordered list of indexed pairs; therefore, + * pairwise data must be written in order and it should be sparse for the + * storage format to be efficient. + */ class Matrix : public EAbstractData { public: class Pair; + public: virtual qint64 dataEnd() const override final; virtual void readData() override final; virtual void writeNewData() override final; virtual void finish() override final; + public: int geneSize() const { return _geneSize; } int maxClusterSize() const { return _maxClusterSize; } qint64 size() const { return _pairSize; } - EMetadata geneNames() const; + EMetaArray geneNames() const; protected: virtual void writeHeader() = 0; virtual void readHeader() = 0; - void initialize(const EMetadata& geneNames, int maxClusterSize, int dataSize, int offset); + void initialize(const EMetaArray& geneNames, int maxClusterSize, int dataSize, int offset); private: - void write(Index index, qint8 cluster); + void write(const Index& index, qint8 cluster); Index getPair(qint64 index, qint8* cluster) const; qint64 findPair(qint64 indent, qint64 first, qint64 last) const; void seekPair(qint64 index) const; + /*! + * The size (in bytes) of the header at the beginning of the file. The header + * consists of the gene size, max cluster size, pairwise data size, total + * number of pairs, total number of clusters, and sub-header offset. + */ constexpr static int _headerSize {30}; + /*! + * The size (in bytes) of the pairwise header. The item header size consists + * of the row and column index of the pair. + */ constexpr static int _itemHeaderSize {9}; + /*! + * The number of genes in the pairwise matrix. + */ qint32 _geneSize {0}; + /*! + * The maximum number of clusters allowed for each pair in the matrix. + */ qint32 _maxClusterSize {0}; + /*! + * The size (in bytes) of a pairwise data element. + */ qint32 _dataSize {0}; + /*! + * The total number of pairs in the matrix. + */ qint64 _pairSize {0}; + /*! + * The total number of clusters (across all pairs) in the matrix. + */ qint64 _clusterSize {0}; - qint16 _offset {0}; + /*! + * The size (in bytes) of the sub-header, which occurs after the header + * and can be used by an inheriting class. + */ + qint16 _subHeaderSize {0}; + /*! + * The index of the last pair that was written to the matrix. + */ qint64 _lastWrite {-2}; }; - - - - class Matrix::Pair - { - public: - Pair(Matrix* matrix): - _matrix(matrix), - _cMatrix(matrix), - _index({matrix->_geneSize,0}) - {} - Pair(const Matrix* matrix): - _cMatrix(matrix), - _index({matrix->_geneSize,0}) - {} - Pair() = default; - Pair(const Pair&) = default; - Pair(Pair&&) = default; - virtual void clearClusters() const = 0; - virtual void addCluster(int amount = 1) const = 0; - virtual int clusterSize() const = 0; - virtual bool isEmpty() const = 0; - void write(Index index); - void read(Index index) const; - void reset() const { _rawIndex = 0; }; - void readNext() const; - bool hasNext() const { return _rawIndex != _cMatrix->_clusterSize; } - const Index& index() const { return _index; } - Pair& operator=(const Pair&) = default; - Pair& operator=(Pair&&) = default; - protected: - virtual void writeCluster(EDataStream& stream, int cluster) = 0; - virtual void readCluster(const EDataStream& stream, int cluster) const = 0; - private: - Matrix* _matrix {nullptr}; - const Matrix* _cMatrix; - mutable qint64 _rawIndex {0}; - mutable Index _index; - }; } diff --git a/src/core/pairwise_matrix_pair.cpp b/src/core/pairwise_matrix_pair.cpp new file mode 100644 index 0000000..823832a --- /dev/null +++ b/src/core/pairwise_matrix_pair.cpp @@ -0,0 +1,138 @@ +#include "pairwise_matrix_pair.h" + + + +using namespace Pairwise; + + + + + + +/*! + * Write the iterator's pairwise data to the data object file with the given + * pairwise index. + * + * @param index + */ +void Matrix::Pair::write(const Index& index) +{ + EDEBUG_FUNC(this,&index); + + // make sure cluster size of pair does not exceed max + if ( clusterSize() > _matrix->_maxClusterSize ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Pairwise Logical Error")); + e.setDetails(tr("Cannot write pair with cluster size %1 exceeding the max of %2.") + .arg(clusterSize()) + .arg(_matrix->_maxClusterSize)); + throw e; + } + + // go through each cluster and write it to data object + for (int i = 0; i < clusterSize() ;++i) + { + _matrix->write(index,i); + writeCluster(_matrix->stream(),i); + } + + // increment pair size of data object + ++(_matrix->_pairSize); +} + + + + + + +/*! + * Read the pair with the given pairwise index from the data object file. + * + * @param index + */ +void Matrix::Pair::read(const Index& index) const +{ + EDEBUG_FUNC(this,&index); + + // clear any existing clusters + clearClusters(); + + // attempt to find cluster index within data object + qint64 clusterIndex; + if ( _cMatrix->_clusterSize > 0 + && (clusterIndex = _cMatrix->findPair(index.indent(0),0,_cMatrix->_clusterSize - 1)) != -1 ) + { + // pair found, read in all clusters + _rawIndex = clusterIndex; + readNext(); + } +} + + + + + + +/*! + * Read the next pair in the data object file. + */ +void Matrix::Pair::readNext() const +{ + EDEBUG_FUNC(this); + + // make sure read next index is not already at end of data object + if ( _rawIndex < _cMatrix->_clusterSize ) + { + // clear any existing clusters + clearClusters(); + + // get to first cluster + qint8 cluster; + Index index {_cMatrix->getPair(_rawIndex++,&cluster)}; + + // make sure this is cluster 0 + if ( cluster != 0 ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("File IO Error")); + e.setDetails(tr("Reading pair failed because first cluster is not 0.")); + throw e; + } + + // add first cluster, read it in, and save pairwise index + addCluster(); + readCluster(_cMatrix->stream(),0); + _index = index; + + // read in remaining clusters for pair + qint8 count {1}; + while ( _rawIndex < _cMatrix->_clusterSize ) + { + // get next pair cluster + _cMatrix->getPair(_rawIndex++,&cluster); + + // if cluster is zero this is the next pair so break from loop + if ( cluster == 0 ) + { + --_rawIndex; + break; + } + + // make sure max cluster size has not been exceeded + if ( ++count > _cMatrix->_maxClusterSize ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Pairwise Logical Error")); + e.setDetails(tr("Cannot read pair with cluster size %1 exceeding the max of %2.") + .arg(count) + .arg(_matrix->_maxClusterSize)); + throw e; + } + + // add new cluster and read it in + addCluster(); + readCluster(_cMatrix->stream(),cluster); + } + } +} diff --git a/src/core/pairwise_matrix_pair.h b/src/core/pairwise_matrix_pair.h new file mode 100644 index 0000000..e116e49 --- /dev/null +++ b/src/core/pairwise_matrix_pair.h @@ -0,0 +1,65 @@ +#ifndef PAIRWISE_MATRIX_PAIR_H +#define PAIRWISE_MATRIX_PAIR_H +#include "pairwise_matrix.h" + + + +namespace Pairwise +{ + /*! + * This class implements the pairwise iterator for the pairwise matrix + * data object. The pairwise iterator can read from or write to any pair in + * the pairwise matrix, or it can simply iterate through each pair. The + * iterator stores only one pair in memory at a time. + */ + class Matrix::Pair + { + public: + Pair(Matrix* matrix): + _matrix(matrix), + _cMatrix(matrix) + {} + Pair(const Matrix* matrix): + _cMatrix(matrix) + {} + Pair() = default; + Pair(const Pair&) = default; + Pair(Pair&&) = default; + virtual void clearClusters() const = 0; + virtual void addCluster(int amount = 1) const = 0; + virtual int clusterSize() const = 0; + virtual bool isEmpty() const = 0; + void write(const Index& index); + void read(const Index& index) const; + void reset() const { _rawIndex = 0; }; + void readNext() const; + bool hasNext() const { return _rawIndex != _cMatrix->_clusterSize; } + const Index& index() const { return _index; } + Pair& operator=(const Pair&) = default; + Pair& operator=(Pair&&) = default; + protected: + virtual void writeCluster(EDataStream& stream, int cluster) = 0; + virtual void readCluster(const EDataStream& stream, int cluster) const = 0; + private: + /*! + * Pointer to the parent pairwise matrix. + */ + Matrix* _matrix {nullptr}; + /*! + * Constant pointer to the parent pairwise matrix. + */ + const Matrix* _cMatrix; + /*! + * The iterator's current position in the pairwise matrix. + */ + mutable qint64 _rawIndex {0}; + /*! + * Pairwise index corresponding to the iterator's position. + */ + mutable Index _index; + }; +} + + + +#endif diff --git a/src/core/pairwise_pearson.cpp b/src/core/pairwise_pearson.cpp index 871587b..163569b 100644 --- a/src/core/pairwise_pearson.cpp +++ b/src/core/pairwise_pearson.cpp @@ -9,6 +9,15 @@ using namespace Pairwise; +/*! + * Compute the Pearson correlation of a cluster in a pairwise data array. The + * data array should only contain samples that have a non-negative label. + * + * @param data + * @param labels + * @param cluster + * @param minSamples + */ float Pearson::computeCluster( const QVector& data, const QVector& labels, @@ -23,20 +32,25 @@ float Pearson::computeCluster( float sumy2 = 0; float sumxy = 0; - for ( int i = 0; i < labels.size(); ++i ) + for ( int i = 0, j = 0; i < labels.size(); ++i ) { - if ( labels[i] == cluster ) + if ( labels[i] >= 0 ) { - float x_i = data[i].s[0]; - float y_i = data[i].s[1]; + if ( labels[i] == cluster ) + { + float x_i = data[j].s[0]; + float y_i = data[j].s[1]; - sumx += x_i; - sumy += y_i; - sumx2 += x_i * x_i; - sumy2 += y_i * y_i; - sumxy += x_i * y_i; + sumx += x_i; + sumy += y_i; + sumx2 += x_i * x_i; + sumy2 += y_i * y_i; + sumxy += x_i * y_i; - ++n; + ++n; + } + + ++j; } } diff --git a/src/core/pairwise_pearson.h b/src/core/pairwise_pearson.h index e3a4529..72e25bc 100644 --- a/src/core/pairwise_pearson.h +++ b/src/core/pairwise_pearson.h @@ -1,15 +1,14 @@ #ifndef PAIRWISE_PEARSON_H #define PAIRWISE_PEARSON_H -#include "pairwise_correlation.h" +#include "pairwise_correlationmodel.h" namespace Pairwise { - class Pearson : public Correlation + /*! + * This class implements the Pearson correlation model. + */ + class Pearson : public CorrelationModel { - public: - void initialize(ExpressionMatrix* /*input*/) {} - QString getName() const { return "pearson"; } - protected: float computeCluster( const QVector& data, diff --git a/src/core/pairwise_spearman.cpp b/src/core/pairwise_spearman.cpp index a71ddcd..b85387f 100644 --- a/src/core/pairwise_spearman.cpp +++ b/src/core/pairwise_spearman.cpp @@ -9,10 +9,36 @@ using namespace Pairwise; -void Spearman::initialize(ExpressionMatrix* input) +/*! + * Compute the next power of 2 which occurs after a number. + * + * @param n + */ +int Spearman::nextPower2(int n) +{ + int pow2 = 2; + while ( pow2 < n ) + { + pow2 *= 2; + } + + return pow2; +} + + + + + + +/*! + * Construct a Spearman correlation model. + * + * @param emx + */ +Spearman::Spearman(ExpressionMatrix* emx) { // pre-allocate workspace - int workSize = nextPower2(input->getSampleSize()); + int workSize = nextPower2(emx->sampleSize()); _x.resize(workSize); _y.resize(workSize); @@ -24,13 +50,22 @@ void Spearman::initialize(ExpressionMatrix* input) +/*! + * Compute the Spearman correlation of a cluster in a pairwise data array. The + * data array should only contain samples that have a non-negative label. + * + * @param data + * @param labels + * @param cluster + * @param minSamples + */ float Spearman::computeCluster( const QVector& data, const QVector& labels, qint8 cluster, int minSamples) { - // extract samples in gene pair cluster + // extract samples in pairwise cluster int N_pow2 = nextPower2(labels.size()); int n = 0; @@ -90,22 +125,15 @@ float Spearman::computeCluster( -int Spearman::nextPower2(int n) -{ - int pow2 = 2; - while ( pow2 < n ) - { - pow2 *= 2; - } - - return pow2; -} - - - - - - +/*! + * Sort a list using bitonic sort, while also applying the same swap operations + * to a second list of the same size. The lists should have a size which is a + * power of two. + * + * @param size + * @param sortList + * @param extraList + */ void Spearman::bitonicSort(int size, QVector& sortList, QVector& extraList) { // initialize all variables @@ -138,6 +166,15 @@ void Spearman::bitonicSort(int size, QVector& sortList, QVector& e +/*! + * Sort a list using bitonic sort, while also applying the same swap operations + * to a second list of the same size. The lists should have a size which is a + * power of two. + * + * @param size + * @param sortList + * @param extraList + */ void Spearman::bitonicSort(int size, QVector& sortList, QVector& extraList) { // initialize all variables diff --git a/src/core/pairwise_spearman.h b/src/core/pairwise_spearman.h index 3c27581..394046c 100644 --- a/src/core/pairwise_spearman.h +++ b/src/core/pairwise_spearman.h @@ -1,15 +1,18 @@ #ifndef PAIRWISE_SPEARMAN_H #define PAIRWISE_SPEARMAN_H -#include "pairwise_correlation.h" +#include "pairwise_correlationmodel.h" +#include "expressionmatrix.h" namespace Pairwise { - class Spearman : public Correlation + /*! + * This class implements the Spearman correlation model. + */ + class Spearman : public CorrelationModel { public: - void initialize(ExpressionMatrix* input); - QString getName() const { return "spearman"; } - + static int nextPower2(int n); + Spearman(ExpressionMatrix* emx); protected: float computeCluster( const QVector& data, @@ -17,14 +20,20 @@ namespace Pairwise qint8 cluster, int minSamples ); - private: - int nextPower2(int n); void bitonicSort(int size, QVector& sortList, QVector& extraList); void bitonicSort(int size, QVector& sortList, QVector& extraList); - + /*! + * Workspace for the x data. + */ QVector _x; + /*! + * Workspace for the y data. + */ QVector _y; + /*! + * Workspace for the rank data. + */ QVector _rank; }; } diff --git a/src/core/powerlaw.cpp b/src/core/powerlaw.cpp new file mode 100644 index 0000000..d695e97 --- /dev/null +++ b/src/core/powerlaw.cpp @@ -0,0 +1,365 @@ +#include "powerlaw.h" +#include "powerlaw_input.h" +#include "correlationmatrix.h" + + + +using namespace std; +using RawPair = CorrelationMatrix::RawPair; + + + + + + +/*! + * Return the total number of blocks this analytic must process as steps + * or blocks of work. + */ +int PowerLaw::size() const +{ + EDEBUG_FUNC(this); + + return 1; +} + + + + + + +/*! + * Process the given index with a possible block of results if this analytic + * produces work blocks. This analytic implementation has no work blocks. + * + * @param result + */ +void PowerLaw::process(const EAbstractAnalytic::Block*) +{ + EDEBUG_FUNC(this); + + // initialize log text stream + QTextStream stream(_logfile); + + // load raw correlation data, row-wise maximums + QVector pairs {_input->dumpRawData()}; + QVector maximums {computeMaximums(pairs)}; + + // continue until network is sufficiently scale-free + float threshold {_thresholdStart}; + + while ( true ) + { + qInfo("\n"); + qInfo("threshold: %8.3f", threshold); + + // compute adjacency matrix based on threshold + int size; + QVector adjacencyMatrix {computeAdjacencyMatrix(pairs, maximums, threshold, &size)}; + + qInfo("adjacency matrix: %d", size); + + // make sure that adjacency matrix is not empty + float correlation {0}; + + if ( size > 0 ) + { + // compute degree distribution of matrix + QVector histogram {computeDegreeDistribution(adjacencyMatrix, size)}; + + // compute correlation of degree distribution + correlation = computeCorrelation(histogram); + + qInfo("correlation: %8.3f", correlation); + } + + // output to log file + stream << threshold << "\t" << size << "\t" << correlation << "\n"; + + // TODO: break if network is sufficently scale-free + + // decrement threshold and fail if minimum threshold is reached + threshold -= _thresholdStep; + if ( threshold < _thresholdStop ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Power-law Threshold Error")); + e.setDetails(tr("Could not find scale-free network above stopping threshold.")); + throw e; + } + } + + // write final threshold + stream << threshold << "\n"; +} + + + + + + +/*! + * Make a new input object and return its pointer. + */ +EAbstractAnalytic::Input* PowerLaw::makeInput() +{ + EDEBUG_FUNC(this); + + return new Input(this); +} + + + + + + +/*! + * Initialize this analytic. This implementation checks to make sure the input + * data object and output log file have been set, and that various integer + * arguments are valid. + */ +void PowerLaw::initialize() +{ + EDEBUG_FUNC(this); + + // make sure input and output were set properly + if ( !_input || !_logfile ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Invalid Argument")); + e.setDetails(tr("Did not get valid input or logfile arguments.")); + throw e; + } + + // make sure threshold arguments are valid + if ( _thresholdStart <= _thresholdStop ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Invalid Argument")); + e.setDetails(tr("Starting threshold must be greater than stopping threshold.")); + throw e; + } +} + + + + + + +/*! + * Compute the row-wise maximums of a correlation matrix. + * + * @param pairs + */ +QVector PowerLaw::computeMaximums(const QVector& pairs) +{ + EDEBUG_FUNC(this,&pairs); + + // initialize elements to minimum value + QVector maximums(_input->geneSize(), 0); + + // compute maximum correlation of each row + for ( auto& pair : pairs ) + { + int i = pair.index.getX(); + + for ( int k = 0; k < pair.correlations.size(); ++k ) + { + float correlation = fabs(pair.correlations[k]); + + if ( maximums[i] < correlation ) + { + maximums[i] = correlation; + } + } + } + + // return row-wise maximums + return maximums; +} + + + + + + +/*! + * Compute the adjacency matrix of a correlation matrix with a given threshold. + * This function uses the pre-computed row-wise maximums for faster computation. + * Additionally, all zero-columns removed. The number of rows in the adjacency + * matrix is returned as a pointer argument. + * + * @param pairs + * @param maximums + * @param threshold + * @param size + */ +QVector PowerLaw::computeAdjacencyMatrix(const QVector& pairs, const QVector& maximums, float threshold, int* size) +{ + EDEBUG_FUNC(this,&pairs,&maximums,threshold,size); + + // generate vector of row indices that have a correlation above threshold + QVector indices(_input->geneSize(), -1); + int pruneSize = 0; + + for ( int i = 0; i < maximums.size(); ++i ) + { + if ( maximums[i] >= threshold ) + { + indices[i] = pruneSize; + pruneSize++; + } + } + + // extract adjacency matrix from correlation matrix + QVector adjacencyMatrix(pruneSize * pruneSize); + + // initialize diagonal + for ( int i = 0; i < pruneSize; ++i ) + { + adjacencyMatrix[i * pruneSize + i] = 1; + } + + // iterate through all pairs + for ( auto& pair : pairs ) + { + // get indices into pruned matrix + int i = indices[pair.index.getX()]; + int j = indices[pair.index.getY()]; + + // skip pair if it was pruned + if ( i == -1 || j == -1 ) + { + continue; + } + + // select correlation from pair + float correlation = pair.correlations[0]; + + // save correlation if it is above threshold + if ( fabs(correlation) >= threshold ) + { + adjacencyMatrix[i * pruneSize + j] = 1; + adjacencyMatrix[j * pruneSize + i] = 1; + } + } + + // save size of adjacency matrix + *size = pruneSize; + + // return adjacency matrix + return adjacencyMatrix; +} + + + + + + +/*! + * Compute the degree distribution of an adjacency matrix. + * + * @param matrix + * @param size + */ +QVector PowerLaw::computeDegreeDistribution(const QVector& matrix, int size) +{ + EDEBUG_FUNC(this,&matrix,size); + + // compute degree of each node + QVector degrees(size); + + for ( int i = 0; i < size; i++ ) + { + for ( int j = 0; j < size; j++ ) + { + degrees[i] += matrix[i * size + j]; + } + } + + // compute max degree + int max {0}; + + for ( int i = 0; i < degrees.size(); i++ ) + { + if ( max < degrees[i] ) + { + max = degrees[i]; + } + } + + // compute histogram of degrees + QVector histogram(max); + + for ( int i = 0; i < degrees.size(); i++ ) + { + if ( degrees[i] > 0 ) + { + histogram[degrees[i] - 1]++; + } + } + + return histogram; +} + + + + + + +/*! + * Compare a degree distribution to a power-law distribution. The goodness-of-fit + * is measured by the Pearson correlation of the log-transformed histogram. + * + * @param histogram + */ +float PowerLaw::computeCorrelation(const QVector& histogram) +{ + EDEBUG_FUNC(this,&histogram); + + // compute log-log transform of histogram data + const int n = histogram.size(); + QVector x(n); + QVector y(n); + + for ( int i = 0; i < n; i++ ) + { + x[i] = log(i + 1); + y[i] = log(histogram[i] + 1); + } + + // visualize log-log histogram + qInfo("histogram:"); + + for ( int i = 0; i < 10; i++ ) + { + float sum {0}; + for ( int j = i * n / 10; j < (i + 1) * n / 10; j++ ) + { + sum += y[j]; + } + + int len {(int)(sum / log((float) _input->geneSize()))}; + QString bin(len, '#'); + + qInfo(" | %s", bin.toStdString().c_str()); + } + + // compute Pearson correlation of x, y + float sumx = 0; + float sumy = 0; + float sumx2 = 0; + float sumy2 = 0; + float sumxy = 0; + + for ( int i = 0; i < n; ++i ) + { + sumx += x[i]; + sumy += y[i]; + sumx2 += x[i] * x[i]; + sumy2 += y[i] * y[i]; + sumxy += x[i] * y[i]; + } + + return (n*sumxy - sumx*sumy) / sqrt((n*sumx2 - sumx*sumx) * (n*sumy2 - sumy*sumy)); +} diff --git a/src/core/powerlaw.h b/src/core/powerlaw.h new file mode 100644 index 0000000..ec4de3c --- /dev/null +++ b/src/core/powerlaw.h @@ -0,0 +1,55 @@ +#ifndef POWERLAW_H +#define POWERLAW_H +#include +#include "correlationmatrix.h" + + + +/*! + * This class implements the Power-law thresholding analytic. This analytic takes + * a correlation matrix and attempts to find a threshold which, when applied to + * the correlation matrix, produces a scale-free network. Each thresholded network + * is evaluted by comparing the degree distribution of the network to a power-law + * distribution. This process is repeated at each threshold step from the starting + * threshold. + */ +class PowerLaw : public EAbstractAnalytic +{ + Q_OBJECT +public: + class Input; + virtual int size() const override final; + virtual void process(const EAbstractAnalytic::Block* result) override final; + virtual EAbstractAnalytic::Input* makeInput() override final; + virtual void initialize(); +private: + QVector computeMaximums(const QVector& pairs); + QVector computeAdjacencyMatrix(const QVector& pairs, const QVector& maximums, float threshold, int* size); + QVector computeDegreeDistribution(const QVector& matrix, int size); + float computeCorrelation(const QVector& histogram); + /*! + * Pointer to the input correlation matrix. + */ + CorrelationMatrix* _input {nullptr}; + /*! + * Pointer to the output log file. + */ + QFile* _logfile {nullptr}; + /*! + * The starting threshold. + */ + float _thresholdStart {0.99}; + /*! + * The threshold decrement. + */ + float _thresholdStep {0.01}; + /*! + * The stopping threshold. The analytic will fail if it cannot find a + * proper threshold before reaching the stopping threshold. + */ + float _thresholdStop {0.5}; +}; + + + +#endif diff --git a/src/core/powerlaw_input.cpp b/src/core/powerlaw_input.cpp new file mode 100644 index 0000000..3471f6d --- /dev/null +++ b/src/core/powerlaw_input.cpp @@ -0,0 +1,203 @@ +#include "powerlaw_input.h" +#include "correlationmatrix.h" +#include "datafactory.h" + + + + + + +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ +PowerLaw::Input::Input(PowerLaw* parent): + EAbstractAnalytic::Input(parent), + _base(parent) +{ + EDEBUG_FUNC(this,parent); +} + + + + + + +/*! + * Return the total number of arguments this analytic type contains. + */ +int PowerLaw::Input::size() const +{ + EDEBUG_FUNC(this); + + return Total; +} + + + + + + +/*! + * Return the argument type for a given index. + * + * @param index + */ +EAbstractAnalytic::Input::Type PowerLaw::Input::type(int index) const +{ + EDEBUG_FUNC(this,index); + + switch (index) + { + case InputData: return Type::DataIn; + case LogFile: return Type::FileOut; + case ThresholdStart: return Type::Double; + case ThresholdStep: return Type::Double; + case ThresholdStop: return Type::Double; + default: return Type::Boolean; + } +} + + + + + + +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ +QVariant PowerLaw::Input::data(int index, Role role) const +{ + EDEBUG_FUNC(this,index,role); + + switch (index) + { + case InputData: + switch (role) + { + case Role::CommandLineName: return QString("input"); + case Role::Title: return tr("Input:"); + case Role::WhatsThis: return tr("Correlation matrix for which an appropriate correlation threshold will be found."); + case Role::DataType: return DataFactory::CorrelationMatrixType; + default: return QVariant(); + } + case LogFile: + switch (role) + { + case Role::CommandLineName: return QString("log"); + case Role::Title: return tr("Log File:"); + case Role::WhatsThis: return tr("Output text file that logs all results."); + case Role::FileFilters: return tr("Text file %1").arg("(*.txt)"); + default: return QVariant(); + } + case ThresholdStart: + switch (role) + { + case Role::CommandLineName: return QString("tstart"); + case Role::Title: return tr("Threshold Start:"); + case Role::WhatsThis: return tr("Starting threshold."); + case Role::Default: return 0.99; + case Role::Minimum: return 0; + case Role::Maximum: return 1; + default: return QVariant(); + } + case ThresholdStep: + switch (role) + { + case Role::CommandLineName: return QString("tstep"); + case Role::Title: return tr("Threshold Step:"); + case Role::WhatsThis: return tr("Threshold step size."); + case Role::Default: return 0.01; + case Role::Minimum: return 0; + case Role::Maximum: return 1; + default: return QVariant(); + } + case ThresholdStop: + switch (role) + { + case Role::CommandLineName: return QString("tstop"); + case Role::Title: return tr("Threshold Stop:"); + case Role::WhatsThis: return tr("Stopping threshold."); + case Role::Default: return 0.5; + case Role::Minimum: return 0; + case Role::Maximum: return 1; + default: return QVariant(); + } + default: return QVariant(); + } +} + + + + + + +/*! + * Set an argument with the given index to the given value. + * + * @param index + * @param value + */ +void PowerLaw::Input::set(int index, const QVariant& value) +{ + EDEBUG_FUNC(this,index,&value); + + switch (index) + { + case ThresholdStart: + _base->_thresholdStart = value.toDouble(); + break; + case ThresholdStep: + _base->_thresholdStep = value.toDouble(); + break; + case ThresholdStop: + _base->_thresholdStop = value.toDouble(); + break; + } +} + + + + + + +/*! + * Set a file argument with the given index to the given qt file pointer. + * + * @param index + * @param file + */ +void PowerLaw::Input::set(int index, QFile* file) +{ + EDEBUG_FUNC(this,index,file); + + if ( index == LogFile ) + { + _base->_logfile = file; + } +} + + + + + + +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ +void PowerLaw::Input::set(int index, EAbstractData* data) +{ + EDEBUG_FUNC(this,index,data); + + if ( index == InputData ) + { + _base->_input = data->cast(); + } +} diff --git a/src/core/powerlaw_input.h b/src/core/powerlaw_input.h new file mode 100644 index 0000000..3516c38 --- /dev/null +++ b/src/core/powerlaw_input.h @@ -0,0 +1,42 @@ +#ifndef POWERLAW_INPUT_H +#define POWERLAW_INPUT_H +#include "powerlaw.h" + + + +/*! + * This class implements the abstract input of the PowerLaw analytic. + */ +class PowerLaw::Input : public EAbstractAnalytic::Input +{ + Q_OBJECT +public: + /*! + * Defines all input arguments for this analytic. + */ + enum Argument + { + InputData = 0 + ,LogFile + ,ThresholdStart + ,ThresholdStep + ,ThresholdStop + ,Total + }; + explicit Input(PowerLaw* parent); + virtual int size() const override final; + virtual EAbstractAnalytic::Input::Type type(int index) const override final; + virtual QVariant data(int index, Role role) const override final; + virtual void set(int index, const QVariant& value) override final; + virtual void set(int index, QFile* file) override final; + virtual void set(int index, EAbstractData* data) override final; +private: + /*! + * Pointer to the base analytic for this object. + */ + PowerLaw* _base; +}; + + + +#endif diff --git a/src/core/rmt.cpp b/src/core/rmt.cpp index 4ab2eea..88e6817 100644 --- a/src/core/rmt.cpp +++ b/src/core/rmt.cpp @@ -1,27 +1,28 @@ -#include -#include #include #include -#include -#include -#include +#include #include "rmt.h" #include "rmt_input.h" -#include "correlationmatrix.h" -#include "datafactory.h" using namespace std; +using RawPair = CorrelationMatrix::RawPair; +/*! + * Return the total number of blocks this analytic must process as steps + * or blocks of work. + */ int RMT::size() const { + EDEBUG_FUNC(this); + return 1; } @@ -30,9 +31,15 @@ int RMT::size() const -void RMT::process(const EAbstractAnalytic::Block* result) +/*! + * Process the given index with a possible block of results if this analytic + * produces work blocks. This analytic implementation has no work blocks. + * + * @param result + */ +void RMT::process(const EAbstractAnalytic::Block*) { - Q_UNUSED(result); + EDEBUG_FUNC(this); // initialize log text stream QTextStream stream(_logfile); @@ -45,18 +52,18 @@ void RMT::process(const EAbstractAnalytic::Block* result) float threshold {_thresholdStart}; // load raw correlation data, row-wise maximums - QVector matrix {_input->dumpRawData()}; - QVector maximums {computeMaximums(matrix)}; + QVector pairs {_input->dumpRawData()}; + QVector maximums {computeMaximums(pairs)}; // continue while max chi is less than final threshold while ( maxChi < _chiSquareThreshold2 ) { qInfo("\n"); - qInfo("threshold: %g", threshold); + qInfo("threshold: %8.3f", threshold); // compute pruned matrix based on threshold int size; - QVector pruneMatrix {computePruneMatrix(matrix, maximums, threshold, &size)}; + QVector pruneMatrix {computePruneMatrix(pairs, maximums, threshold, &size)}; qInfo("prune matrix: %d", size); @@ -70,23 +77,28 @@ void RMT::process(const EAbstractAnalytic::Block* result) qInfo("eigenvalues: %d", eigens.size()); - // compute chi-square value from NNSD of eigenvalues - chi = computeChiSquare(eigens); + // compute unique eigenvalues + QVector unique {computeUnique(eigens)}; + + qInfo("unique eigenvalues: %d", unique.size()); - qInfo("chi-square: %g", chi); + // compute chi-squared value from NNSD of eigenvalues + chi = computeChiSquare(unique); + + qInfo("chi-squared: %g", chi); } - // make sure that chi-square test succeeded + // make sure that chi-squared test succeeded if ( chi != -1 ) { - // save the most recent chi-square value less than critical value + // save the most recent chi-squared value less than critical value if ( chi < _chiSquareThreshold1 ) { finalChi = chi; finalThreshold = threshold; } - // save the largest chi-square value which occurs after finalChi + // save the largest chi-squared value which occurs after finalChi if ( finalChi < _chiSquareThreshold1 && chi > finalChi ) { maxChi = chi; @@ -116,8 +128,13 @@ void RMT::process(const EAbstractAnalytic::Block* result) +/*! + * Make a new input object and return its pointer. + */ EAbstractAnalytic::Input* RMT::makeInput() { + EDEBUG_FUNC(this); + return new Input(this); } @@ -126,8 +143,15 @@ EAbstractAnalytic::Input* RMT::makeInput() +/*! + * Initialize this analytic. This implementation checks to make sure the input + * data object and output log file have been set, and that various integer + * arguments are valid. + */ void RMT::initialize() { + EDEBUG_FUNC(this); + // make sure input and output were set properly if ( !_input || !_logfile ) { @@ -147,11 +171,11 @@ void RMT::initialize() } // make sure pace arguments are valid - if ( _minUnfoldingPace >= _maxUnfoldingPace ) + if ( _minSplinePace >= _maxSplinePace ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("Invalid Argument")); - e.setDetails(tr("Minimum unfolding pace must be less than maximum unfolding pace.")); + e.setDetails(tr("Minimum spline pace must be less than maximum spline pace.")); throw e; } } @@ -161,36 +185,35 @@ void RMT::initialize() -QVector RMT::computeMaximums(const QVector& matrix) +/*! + * Compute the row-wise maximums of a correlation matrix. + * + * @param pairs + */ +QVector RMT::computeMaximums(const QVector& pairs) { - const int N {_input->geneSize()}; - const int K {_input->maxClusterSize()}; + EDEBUG_FUNC(this,&pairs); // initialize elements to minimum value - QVector maximums(N * K, 0); + QVector maximums(_input->geneSize(), 0); - // compute maximum of each row/column - for ( int i = 0; i < N; ++i ) + // compute maximum correlation of each row + for ( auto& pair : pairs ) { - for ( int j = 0; j < i; ++j ) - { - for ( int k = 0; k < K; ++k ) - { - float correlation = fabs(matrix[i * N * K + j * K + k]); + int i = pair.index.getX(); - if ( maximums[i * K + k] < correlation ) - { - maximums[i * K + k] = correlation; - } + for ( int k = 0; k < pair.correlations.size(); ++k ) + { + float correlation = fabs(pair.correlations[k]); - if ( maximums[j * K + k] < correlation ) - { - maximums[j * K + k] = correlation; - } + if ( maximums[i] < correlation ) + { + maximums[i] = correlation; } } } + // return row-wise maximums return maximums; } @@ -199,48 +222,107 @@ QVector RMT::computeMaximums(const QVector& matrix) -QVector RMT::computePruneMatrix(const QVector& matrix, const QVector& maximums, float threshold, int* size) +/*! + * Compute the pruned matrix of a correlation matrix with a given threshold. This + * function uses the pre-computed row-wise maximums for faster computation. The + * returned matrix is equivalent to the correlation matrix with all correlations + * below the given threshold removed, and all zero-columns removed. Additionally, + * the number of rows in the pruned matrix is returned as a pointer argument. + * + * @param pairs + * @param maximums + * @param threshold + * @param size + */ +QVector RMT::computePruneMatrix(const QVector& pairs, const QVector& maximums, float threshold, int* size) { - const int N {_input->geneSize()}; - const int K {_input->maxClusterSize()}; + EDEBUG_FUNC(this,&pairs,&maximums,threshold,size); - // generate vector of row/column indices that have a correlation above threshold - QVector indices; + // generate vector of row indices that have a correlation above threshold + QVector indices(_input->geneSize(), -1); + int pruneSize = 0; for ( int i = 0; i < maximums.size(); ++i ) { if ( maximums[i] >= threshold ) { - indices.append(i); + indices[i] = pruneSize; + pruneSize++; } } // extract pruned matrix from correlation matrix - QVector pruneMatrix(indices.size() * indices.size()); + QVector pruneMatrix(pruneSize * pruneSize); - for ( int i = 0; i < indices.size(); ++i ) + // initialize diagonal + for ( int i = 0; i < pruneSize; ++i ) { - for ( int j = 0; j < i; ++j ) + pruneMatrix[i * pruneSize + i] = 1; + } + + // iterate through all pairs + for ( auto& pair : pairs ) + { + // get indices into pruned matrix + int i = indices[pair.index.getX()]; + int j = indices[pair.index.getY()]; + + // skip pair if it was pruned + if ( i == -1 || j == -1 ) { - if ( indices[i] % K != indices[j] % K ) - { - continue; - } + continue; + } - float correlation = matrix[indices[i]/K * N * K + indices[j]/K * K + indices[i] % K]; + // select correlation from pair using reduction method + float correlation = 0; - if ( fabs(correlation) >= threshold ) + switch ( _reductionMethod ) + { + case ReductionMethod::First: + { + correlation = pair.correlations[0]; + break; + } + case ReductionMethod::MaximumCorrelation: + { + for ( int k = 0; k < pair.correlations.size(); k++ ) { - pruneMatrix[i * indices.size() + j] = correlation; + float r = fabs(pair.correlations[k]); + + if ( correlation < r ) + { + correlation = r; + } } + break; + } + case ReductionMethod::MaximumSize: + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Unsupported Option")); + e.setDetails(tr("Pairwise reduction by maximum size is not yet supported.")); + throw e; } + case ReductionMethod::Random: + { + int k = qrand() % pair.correlations.size(); + correlation = pair.correlations[k]; + break; + } + }; - pruneMatrix[i * indices.size() + i] = 1; + // save correlation if it is above threshold + if ( fabs(correlation) >= threshold ) + { + pruneMatrix[i * pruneSize + j] = correlation; + pruneMatrix[j * pruneSize + i] = correlation; + } } // save size of pruned matrix - *size = indices.size(); + *size = pruneSize; + // return pruned matrix return pruneMatrix; } @@ -249,36 +331,35 @@ QVector RMT::computePruneMatrix(const QVector& matrix, const QVect -QVector RMT::computeEigenvalues(QVector* pruneMatrix, int size) +/*! + * Compute the eigenvalues of a correlation matrix. + * + * @param matrix + * @param size + */ +QVector RMT::computeEigenvalues(QVector* matrix, int size) { - // using declarations for gsl resources - using gsl_vector_ptr = unique_ptr; - using gsl_matrix_ptr = unique_ptr; - using gsl_eigen_symmv_workspace_ptr = unique_ptr; - - QVector temp; - for (auto val: *pruneMatrix) temp.append(val); - - // make and initialize gsl eigen resources - gsl_matrix_view view = gsl_matrix_view_array(temp.data(),size,size); - gsl_vector_ptr eval (gsl_vector_alloc(size),&gsl_vector_free); - gsl_matrix_ptr evec (gsl_matrix_alloc(size,size),&gsl_matrix_free); - gsl_eigen_symmv_workspace_ptr work (gsl_eigen_symmv_alloc(size),&gsl_eigen_symmv_free); - - // have gsl compute eigen values for the pruned matrix - gsl_eigen_symmv(&view.matrix,eval.get(),evec.get(),work.get()); - gsl_eigen_symmv_sort(eval.get(),evec.get(),GSL_EIGEN_SORT_ABS_ASC); - - // create return vector and get eigen values from gsl - QVector ret(size); - for (int i = 0; i < size ;i++) + EDEBUG_FUNC(this,matrix,size); + + // initialize eigenvalues and workspace + QVector eigens(size); + QVector work(5 * size); + + // compute eigenvalues + int info = LAPACKE_ssyev_work( + LAPACK_COL_MAJOR, 'N', 'U', + size, matrix->data(), size, + eigens.data(), + work.data(), work.size()); + + // print warning if LAPACKE returned error code + if ( info != 0 ) { - ret[i] = gsl_vector_get(eval.get(),i); + qInfo("warning: LAPACKE ssyev returned %d", info); } - // return eigen values vector - return ret; + // return eigenvalues + return eigens; } @@ -286,40 +367,93 @@ QVector RMT::computeEigenvalues(QVector* pruneMatrix, int size) -float RMT::computeChiSquare(const QVector& eigens) +/*! + * Return the unique values of a sorted list of real numbers. Two real numbers + * are unique if their absolute difference is greater than some small value + * epsilon. + * + * @param values + */ +QVector RMT::computeUnique(const QVector& values) { - // compute unique eigenvalues - QVector unique {degenerate(eigens)}; + EDEBUG_FUNC(this,&values); - qInfo("unique eigenvalues: %d", unique.size()); + const float EPSILON {1e-6}; + QVector unique; - // make sure there are enough unique eigenvalues - if ( unique.size() < _minEigenvalueSize ) + for ( int i = 1; i < values.size(); ++i ) { - return -1; + if ( unique.isEmpty() || fabs(values.at(i) - unique.last()) > EPSILON ) + { + unique.append(values.at(i)); + } } - // perform several chi-square tests by varying the pace - float chi {0.0}; - int chiTestCount {0}; + return unique; +} + + - for ( int pace = _minUnfoldingPace; pace <= _maxUnfoldingPace; ++pace ) + + + + +/*! + * Compute the chi-squared test for the nearest-neighbor spacing distribution + * (NNSD) of a list of eigenvalues. The list should be sorted and should contain + * only unique values. If spline interpolation is enabled, the chi-squared value + * is an average of several chi-squared tests, in which splines of varying pace + * are applied to the eigenvalues. Otherwise, a single chi-squared test is + * performed directly on the eigenvalues. + * + * @param eigens + */ +float RMT::computeChiSquare(const QVector& eigens) +{ + EDEBUG_FUNC(this,&eigens); + + // make sure there are enough eigenvalues + if ( eigens.size() < _minEigenvalueSize ) { - // perform test only if there are enough eigenvalues for pace - if ( unique.size() / pace < 5 ) + return -1; + } + + // determine whether spline interpolation is enabled + if ( _splineInterpolation ) + { + // perform several chi-squared tests with spline interpolation by varying the pace + float chi {0.0}; + int chiTestCount {0}; + + for ( int pace = _minSplinePace; pace <= _maxSplinePace; ++pace ) { - break; - } + // perform test only if there are enough eigenvalues for pace + if ( eigens.size() / pace < 5 ) + { + break; + } - chi += computePaceChiSquare(unique, pace); - ++chiTestCount; - } + // compute spline-interpolated eigenvalues + QVector splineEigens {computeSpline(eigens, pace)}; - // compute average of chi-square tests - chi /= chiTestCount; + // compute chi-squared value + float chiPace {computeChiSquareHelper(splineEigens)}; - // return chi value - return chi; + qInfo("pace: %d, chi-squared: %g", pace, chiPace); + + // append chi-squared value to running sum + chi += chiPace; + ++chiTestCount; + } + + // return average of chi-squared tests + return chi / chiTestCount; + } + else + { + // perform a single chi-squared test without spline interpolation + return computeChiSquareHelper(eigens); + } } @@ -327,12 +461,21 @@ float RMT::computeChiSquare(const QVector& eigens) -float RMT::computePaceChiSquare(const QVector& eigens, int pace) +/*! + * Compute the chi-squared test for the nearest-neighbor spacing distribution + * (NNSD) of a list of values. The list should be sorted and should contain only + * unique values. + * + * @param values + */ +float RMT::computeChiSquareHelper(const QVector& values) { - // compute eigenvalue spacings - QVector spacings {unfold(eigens, pace)}; + EDEBUG_FUNC(this,&values); + + // compute spacings + QVector spacings {computeSpacings(values)}; - // compute nearest-neighbor spacing distribution + // compute histogram of spacings const float histogramMin {0}; const float histogramMax {3}; const float histogramBinWidth {(histogramMax - histogramMin) / _histogramBinSize}; @@ -346,7 +489,7 @@ float RMT::computePaceChiSquare(const QVector& eigens, int pace) } } - // compute chi-square value from nearest-neighbor spacing distribution + // compute chi-squared value from the histogram float chi {0.0}; for ( int i = 0; i < histogram.size(); ++i ) @@ -355,14 +498,12 @@ float RMT::computePaceChiSquare(const QVector& eigens, int pace) float O_i {histogram[i]}; // compute E_i, the expected value of Poisson distribution for bin i - float E_i {(exp(-i * histogramBinWidth) - exp(-(i + 1) * histogramBinWidth)) * eigens.size()}; + float E_i {(exp(-i * histogramBinWidth) - exp(-(i + 1) * histogramBinWidth)) * values.size()}; - // update chi-square value based on difference between O_i and E_i + // update chi-squared value based on difference between O_i and E_i chi += (O_i - E_i) * (O_i - E_i) / E_i; } - qInfo("pace: %d, chi: %g", pace, chi); - return chi; } @@ -371,44 +512,34 @@ float RMT::computePaceChiSquare(const QVector& eigens, int pace) -QVector RMT::degenerate(const QVector& eigens) +/*! + * Compute a spline interpolation of a list of values using the given pace. The + * list should be sorted and should contain only unique values. The pace determines + * the ratio of values which are used as points to create the spline; for example, + * a pace of 10 means that every 10th value is used to create the spline. + * + * @param values + * @param pace + */ +QVector RMT::computeSpline(const QVector& values, int pace) { - const float EPSILON {1e-6}; - QVector unique; - - for ( int i = 1; i < eigens.size(); ++i ) - { - if ( unique.isEmpty() || fabs(eigens.at(i) - unique.last()) > EPSILON ) - { - unique.append(eigens.at(i)); - } - } - - return unique; -} - - - + EDEBUG_FUNC(this,&values,pace); - - -QVector RMT::unfold(const QVector& eigens, int pace) -{ // using declarations for gsl resource pointers using gsl_interp_accel_ptr = unique_ptr; using gsl_spline_ptr = unique_ptr; // extract eigenvalues for spline based on pace - int splineSize {eigens.size() / pace}; + int splineSize {values.size() / pace}; unique_ptr x(new double[splineSize]); unique_ptr y(new double[splineSize]); for ( int i = 0; i < splineSize; ++i ) { - x[i] = (double)eigens.at(i*pace); - y[i] = (double)(i*pace + 1) / eigens.size(); + x[i] = (double)values.at(i*pace); + y[i] = (double)(i*pace + 1) / values.size(); } - x[splineSize - 1] = eigens.back(); + x[splineSize - 1] = values.back(); y[splineSize - 1] = 1.0; // initialize gsl spline @@ -417,22 +548,41 @@ QVector RMT::unfold(const QVector& eigens, int pace) gsl_spline_init(spline.get(), x.get(), y.get(), splineSize); // extract interpolated eigenvalues from spline - QVector splineEigens(eigens.size()); + QVector splineValues(values.size()); - splineEigens[0] = 0.0; - splineEigens[eigens.size() - 1] = 1.0; + splineValues[0] = 0.0; + splineValues[values.size() - 1] = 1.0; - for ( int i = 1; i < eigens.size() - 1; ++i ) + for ( int i = 1; i < values.size() - 1; ++i ) { - splineEigens[i] = gsl_spline_eval(spline.get(), eigens.at(i), interp.get()); + splineValues[i] = gsl_spline_eval(spline.get(), values.at(i), interp.get()); } + // return interpolated values + return splineValues; +} + + + + + + +/*! + * Compute the spacings of a list of values. The list should be sorted and should + * contain only unique values. + * + * @param values + */ +QVector RMT::computeSpacings(const QVector& values) +{ + EDEBUG_FUNC(this,&values); + // compute spacings between interpolated eigenvalues - QVector spacings(eigens.size() - 1); + QVector spacings(values.size() - 1); for ( int i = 0; i < spacings.size(); ++i ) { - spacings[i] = (splineEigens.at(i + 1) - splineEigens.at(i)) * eigens.size(); + spacings[i] = (values.at(i + 1) - values.at(i)) * values.size(); } return spacings; diff --git a/src/core/rmt.h b/src/core/rmt.h index 7ebe305..02d9af6 100644 --- a/src/core/rmt.h +++ b/src/core/rmt.h @@ -1,13 +1,24 @@ #ifndef RMT_H #define RMT_H #include +#include "correlationmatrix.h" -class CorrelationMatrix; - - - +/*! + * This class implements the RMT analytic. This analytic takes a correlation + * matrix and attempts to find a threshold which, when applied to the correlation + * matrix, produces a scale-free network. This analytic uses Random Matrix Theory + * (RMT), which involves computing the eigenvalues of a thresholded correlation + * matrix, computing the nearest-neighbor spacing distribution (NNSD) of the eigenvalues, + * and comparing the distribution to a Poisson distribution using a chi-squared + * test. This process is repeated at each threshold step from the starting threshold; + * as the threshold decreases, the NNSD changes from a Poisson distribution to + * a Gaussian orthogonal ensemble (GOE) distribution, so the chi-squared value + * decreases. When the threshold approaches the scale-free threshold, the chi-squared + * value increases sharply, and the final threshold is chosen as the lowest threshold + * which produced a chi-squared value below the critical value. + */ class RMT : public EAbstractAnalytic { Q_OBJECT @@ -18,24 +29,105 @@ class RMT : public EAbstractAnalytic virtual EAbstractAnalytic::Input* makeInput() override final; virtual void initialize(); private: - QVector computeMaximums(const QVector& matrix); - QVector computePruneMatrix(const QVector& matrix, const QVector& maximums, float threshold, int* size); + /*! + * Defines the reduction methods this analytic supports. + */ + enum class ReductionMethod + { + /*! + * Select the first cluster + */ + First + /*! + * Select the cluster with the highest absolute correlation + */ + ,MaximumCorrelation + /*! + * Select the cluster with the largest sample size + */ + ,MaximumSize + /*! + * Select a random cluster + */ + ,Random + }; +private: + QVector computeMaximums(const QVector& pairs); + QVector computePruneMatrix(const QVector& pairs, const QVector& maximums, float threshold, int* size); QVector computeEigenvalues(QVector* pruneMatrix, int size); + QVector computeUnique(const QVector& values); float computeChiSquare(const QVector& eigens); - float computePaceChiSquare(const QVector& eigens, int pace); - QVector degenerate(const QVector& eigens); - QVector unfold(const QVector& eigens, int pace); - + float computeChiSquareHelper(const QVector& values); + QVector computeSpline(const QVector& values, int pace); + QVector computeSpacings(const QVector& values); + /*! + * Pointer to the input correlation matrix. + */ CorrelationMatrix* _input {nullptr}; + /*! + * Pointer to the output log file. + */ QFile* _logfile {nullptr}; + /*! + * The reduction method to use. Pairwise reduction is used to select pairwise + * correlations when there are multiple correlations per pair. By default, the + * first cluster is selected from each pair. + */ + ReductionMethod _reductionMethod {ReductionMethod::First}; + /*! + * The starting threshold. + */ float _thresholdStart {0.99}; + /*! + * The threshold decrement. + */ float _thresholdStep {0.001}; + /*! + * The stopping threshold. The analytic will fail if it cannot find a + * proper threshold before reaching the stopping threshold. + */ float _thresholdStop {0.5}; + /*! + * The critical value for the chi-squared test, which is dependent on the + * degrees of freedom and the alpha-value of the test. This particular + * value is based on df = 60 and alpha = 0.001. Note that since the degrees + * of freedom corresponds to the number of histogram bins, this value + * must be re-calculated if the number of histogram bins is changed. + */ float _chiSquareThreshold1 {99.607}; + /*! + * The final chi-squared threshold. Once the chi-squared test goes below the + * chi-squared critical value, it must go above this value in order for the + * analytic to find a proper threshold. + */ float _chiSquareThreshold2 {200}; + /*! + * The minimum number of unique eigenvalues which must exist in a pruned matrix + * for the analytic to compute the NNSD of the eigenvalues. If the number of + * unique eigenvalues is less, the chi-squared test for that threshold is skipped. + */ int _minEigenvalueSize {50}; - int _minUnfoldingPace {10}; - int _maxUnfoldingPace {40}; + /*! + * Whether to perform spline interpolation on each set of eigenvalues before + * computing the spacings. If this option is enabled then the chi-squared value + * for each set of eigenvalues will be the average of multiple tests in which + * the spline pace is varied (according to the minimum and maximum spline pace); + * otherwise, only one test is performed for each set of eigenvalues. + */ + bool _splineInterpolation {true}; + /*! + * The minimum pace of the spline interpolation. + */ + int _minSplinePace {10}; + /*! + * The maximum pace of the spline interpolation. + */ + int _maxSplinePace {40}; + /*! + * The number of histogram bins in the NNSD of eigenvalues. This value + * corresponds to the degrees of freedom in the chi-squared test, therefore + * it affects the setting of the chi-squared critical value. + */ int _histogramBinSize {60}; }; diff --git a/src/core/rmt_input.cpp b/src/core/rmt_input.cpp index 57233cd..87c3f3b 100644 --- a/src/core/rmt_input.cpp +++ b/src/core/rmt_input.cpp @@ -7,18 +7,48 @@ +/*! + * String list of reduction methods for this analytic that correspond exactly + * to its enumeration. Used for handling the reduction method argument for this + * input object. + */ +const QStringList RMT::Input::REDUCTION_NAMES +{ + "first" + ,"maxcorr" + ,"maxsize" + ,"random" +}; + + + + + + +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ RMT::Input::Input(RMT* parent): EAbstractAnalytic::Input(parent), _base(parent) -{} +{ + EDEBUG_FUNC(this,parent); +} +/*! + * Return the total number of arguments this analytic type contains. + */ int RMT::Input::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -27,17 +57,26 @@ int RMT::Input::size() const +/*! + * Return the argument type for a given index. + * + * @param index + */ EAbstractAnalytic::Input::Type RMT::Input::type(int index) const { + EDEBUG_FUNC(this,index); + switch (index) { case InputData: return Type::DataIn; case LogFile: return Type::FileOut; + case ReductionType: return Type::Selection; case ThresholdStart: return Type::Double; case ThresholdStep: return Type::Double; case ThresholdStop: return Type::Double; - case MinUnfoldingPace: return Type::Integer; - case MaxUnfoldingPace: return Type::Integer; + case SplineInterpolation: return Type::Boolean; + case MinSplinePace: return Type::Integer; + case MaxSplinePace: return Type::Integer; case HistogramBinSize: return Type::Integer; default: return Type::Boolean; } @@ -48,8 +87,16 @@ EAbstractAnalytic::Input::Type RMT::Input::type(int index) const +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ QVariant RMT::Input::data(int index, Role role) const { + EDEBUG_FUNC(this,index,role); + switch (index) { case InputData: @@ -70,6 +117,16 @@ QVariant RMT::Input::data(int index, Role role) const case Role::FileFilters: return tr("Text file %1").arg("(*.txt)"); default: return QVariant(); } + case ReductionType: + switch (role) + { + case Role::CommandLineName: return QString("reduction"); + case Role::Title: return tr("Reduction Method:"); + case Role::WhatsThis: return tr("Method to use for pairwise reduction."); + case Role::SelectionValues: return REDUCTION_NAMES; + case Role::Default: return "first"; + default: return QVariant(); + } case ThresholdStart: switch (role) { @@ -103,23 +160,32 @@ QVariant RMT::Input::data(int index, Role role) const case Role::Maximum: return 1; default: return QVariant(); } - case MinUnfoldingPace: + case SplineInterpolation: + switch (role) + { + case Role::CommandLineName: return QString("spline"); + case Role::Title: return tr("Use Spline Interpolation:"); + case Role::WhatsThis: return tr("Whether to perform spline interpolation on each set of eigenvalues."); + case Role::Default: return true; + default: return QVariant(); + } + case MinSplinePace: switch (role) { case Role::CommandLineName: return QString("minpace"); - case Role::Title: return tr("Minimum Unfolding Pace:"); - case Role::WhatsThis: return tr("The minimum pace with which to perform unfolding."); + case Role::Title: return tr("Minimum Spline Pace:"); + case Role::WhatsThis: return tr("The minimum pace of the spline interpolation."); case Role::Default: return 10; case Role::Minimum: return 1; case Role::Maximum: return std::numeric_limits::max(); default: return QVariant(); } - case MaxUnfoldingPace: + case MaxSplinePace: switch (role) { case Role::CommandLineName: return QString("maxpace"); - case Role::Title: return tr("Maximum Unfolding Pace:"); - case Role::WhatsThis: return tr("The maximum pace with which to perform unfolding."); + case Role::Title: return tr("Maximum Spline Pace:"); + case Role::WhatsThis: return tr("The maximum pace of the spline interpolation."); case Role::Default: return 40; case Role::Minimum: return 1; case Role::Maximum: return std::numeric_limits::max(); @@ -145,10 +211,21 @@ QVariant RMT::Input::data(int index, Role role) const +/*! + * Set an argument with the given index to the given value. + * + * @param index + * @param value + */ void RMT::Input::set(int index, const QVariant& value) { + EDEBUG_FUNC(this,index,&value); + switch (index) { + case ReductionType: + _base->_reductionMethod = static_cast(REDUCTION_NAMES.indexOf(value.toString())); + break; case ThresholdStart: _base->_thresholdStart = value.toDouble(); break; @@ -158,11 +235,14 @@ void RMT::Input::set(int index, const QVariant& value) case ThresholdStop: _base->_thresholdStop = value.toDouble(); break; - case MinUnfoldingPace: - _base->_minUnfoldingPace = value.toInt(); + case SplineInterpolation: + _base->_splineInterpolation = value.toBool(); break; - case MaxUnfoldingPace: - _base->_maxUnfoldingPace = value.toInt(); + case MinSplinePace: + _base->_minSplinePace = value.toInt(); + break; + case MaxSplinePace: + _base->_maxSplinePace = value.toInt(); break; case HistogramBinSize: _base->_histogramBinSize = value.toInt(); @@ -175,8 +255,16 @@ void RMT::Input::set(int index, const QVariant& value) +/*! + * Set a file argument with the given index to the given qt file pointer. + * + * @param index + * @param file + */ void RMT::Input::set(int index, QFile* file) { + EDEBUG_FUNC(this,index,file); + if ( index == LogFile ) { _base->_logfile = file; @@ -188,8 +276,16 @@ void RMT::Input::set(int index, QFile* file) +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ void RMT::Input::set(int index, EAbstractData* data) { + EDEBUG_FUNC(this,index,data); + if ( index == InputData ) { _base->_input = data->cast(); diff --git a/src/core/rmt_input.h b/src/core/rmt_input.h index 4ac2873..ac4e519 100644 --- a/src/core/rmt_input.h +++ b/src/core/rmt_input.h @@ -4,19 +4,27 @@ +/*! + * This class implements the abstract input of the RMT analytic. + */ class RMT::Input : public EAbstractAnalytic::Input { Q_OBJECT public: + /*! + * Defines all input arguments for this analytic. + */ enum Argument { InputData = 0 ,LogFile + ,ReductionType ,ThresholdStart ,ThresholdStep ,ThresholdStop - ,MinUnfoldingPace - ,MaxUnfoldingPace + ,SplineInterpolation + ,MinSplinePace + ,MaxSplinePace ,HistogramBinSize ,Total }; @@ -28,6 +36,10 @@ class RMT::Input : public EAbstractAnalytic::Input virtual void set(int index, QFile* file) override final; virtual void set(int index, EAbstractData* data) override final; private: + static const QStringList REDUCTION_NAMES; + /*! + * Pointer to the base analytic for this object. + */ RMT* _base; }; diff --git a/src/core/similarity.cpp b/src/core/similarity.cpp index 8155fbf..11bf97b 100644 --- a/src/core/similarity.cpp +++ b/src/core/similarity.cpp @@ -4,6 +4,10 @@ #include "similarity_serial.h" #include "similarity_workblock.h" #include "similarity_opencl.h" +#include "ccmatrix_pair.h" +#include "correlationmatrix_pair.h" +#include +#include @@ -14,12 +18,32 @@ using namespace std; +/*! + * Return the total number of pairs that must be processed for a given + * expression matrix. + * + * @param emx + */ +qint64 Similarity::totalPairs(const ExpressionMatrix* emx) const +{ + EDEBUG_FUNC(this,emx); + + return (qint64) emx->geneSize() * (emx->geneSize() - 1) / 2; +} + + + + + + +/*! + * Return the total number of work blocks this analytic must process. + */ int Similarity::size() const { - const qint64 totalPairs {(qint64) _input->getGeneSize() * (_input->getGeneSize() - 1) / 2}; - const qint64 WORK_BLOCK_SIZE { 32 * 1024 }; + EDEBUG_FUNC(this); - return (totalPairs + WORK_BLOCK_SIZE - 1) / WORK_BLOCK_SIZE; + return (totalPairs(_input) + _workBlockSize - 1) / _workBlockSize; } @@ -27,13 +51,24 @@ int Similarity::size() const +/*! + * Create and return a work block for this analytic with the given index. This + * implementation creates a work block with a start index and size denoting the + * number of pairs to process. + * + * @param index + */ std::unique_ptr Similarity::makeWork(int index) const { - const qint64 totalPairs {(qint64) _input->getGeneSize() * (_input->getGeneSize() - 1) / 2}; - const qint64 WORK_BLOCK_SIZE { 32 * 1024 }; + EDEBUG_FUNC(this,index); - qint64 start {index * WORK_BLOCK_SIZE}; - qint64 size {min(totalPairs - start, WORK_BLOCK_SIZE)}; + if ( ELog::isActive() ) + { + ELog() << tr("Making work index %1 of %2.\n").arg(index).arg(size()); + } + + qint64 start {index * (qint64) _workBlockSize}; + qint64 size {min(totalPairs(_input) - start, (qint64) _workBlockSize)}; return unique_ptr(new WorkBlock(index, start, size)); } @@ -43,8 +78,13 @@ std::unique_ptr Similarity::makeWork(int index) const +/*! + * Create an empty and uninitialized work block. + */ std::unique_ptr Similarity::makeWork() const { + EDEBUG_FUNC(this); + return unique_ptr(new WorkBlock); } @@ -53,8 +93,13 @@ std::unique_ptr Similarity::makeWork() const +/*! + * Create an empty and uninitialized result block. + */ std::unique_ptr Similarity::makeResult() const { + EDEBUG_FUNC(this); + return unique_ptr(new ResultBlock); } @@ -63,8 +108,22 @@ std::unique_ptr Similarity::makeResult() const +/*! + * Read in a block of results made from a block of work with the corresponding + * index. This implementation takes the Pair objects in the result block and + * saves them to the output correlation matrix and cluster matrix. + * + * @param result + */ void Similarity::process(const EAbstractAnalytic::Block* result) { + EDEBUG_FUNC(this,result); + + if ( ELog::isActive() ) + { + ELog() << tr("Processing result %1 of %2.\n").arg(result->index()).arg(size()); + } + const ResultBlock* resultBlock {result->cast()}; // iterate through all pairs in result block @@ -85,7 +144,7 @@ void Similarity::process(const EAbstractAnalytic::Block* result) { ccmPair.addCluster(); - for ( int i = 0; i < _input->getSampleSize(); ++i ) + for ( int i = 0; i < _input->sampleSize(); ++i ) { ccmPair.at(ccmPair.clusterSize() - 1, i) = (pair.labels[i] >= 0) ? (k == pair.labels[i]) @@ -131,8 +190,13 @@ void Similarity::process(const EAbstractAnalytic::Block* result) +/*! + * Make a new input object and return its pointer. + */ EAbstractAnalytic::Input* Similarity::makeInput() { + EDEBUG_FUNC(this); + return new Input(this); } @@ -141,8 +205,13 @@ EAbstractAnalytic::Input* Similarity::makeInput() +/*! + * Make a new serial object and return its pointer. + */ EAbstractAnalytic::Serial* Similarity::makeSerial() { + EDEBUG_FUNC(this); + return new Serial(this); } @@ -151,8 +220,13 @@ EAbstractAnalytic::Serial* Similarity::makeSerial() +/*! + * Make a new OpenCL object and return its pointer. + */ EAbstractAnalytic::OpenCL* Similarity::makeOpenCL() { + EDEBUG_FUNC(this); + return new OpenCL(this); } @@ -161,19 +235,29 @@ EAbstractAnalytic::OpenCL* Similarity::makeOpenCL() +/*! + * Initialize this analytic. This implementation checks to make sure that valid + * arguments were provided. + */ void Similarity::initialize() { - if ( !isMaster() ) + EDEBUG_FUNC(this); + + // get MPI instance + auto& mpi {Ace::QMPI::instance()}; + + // only the master process needs to validate arguments + if ( !mpi.isMaster() ) { return; } - // make sure input and output are valid - if ( !_input || !_ccm || !_cmx ) + // make sure input data is valid + if ( !_input ) { E_MAKE_EXCEPTION(e); e.setTitle(tr("Invalid Argument")); - e.setDetails(tr("Did not get valid input and/or output arguments.")); + e.setDetails(tr("Did not get a valid input data object.")); throw e; } @@ -186,12 +270,42 @@ void Similarity::initialize() throw e; } + // initialize work block size + if ( _workBlockSize == 0 ) + { + int numWorkers = max(1, mpi.size() - 1); + + _workBlockSize = min((qint64) 32768, totalPairs(_input) / numWorkers); + } +} + + + + + + +/*! + * Initialize the output data objects of this analytic. + */ +void Similarity::initializeOutputs() +{ + EDEBUG_FUNC(this); + + // make sure output data is valid + if ( !_ccm || !_cmx ) + { + E_MAKE_EXCEPTION(e); + e.setTitle(tr("Invalid Argument")); + e.setDetails(tr("Did not get valid output data objects.")); + throw e; + } + // initialize cluster matrix - _ccm->initialize(_input->getGeneNames(), _maxClusters, _input->getSampleNames()); + _ccm->initialize(_input->geneNames(), _maxClusters, _input->sampleNames()); // initialize correlation matrix EMetaArray correlations; - correlations.append(_corrModel->getName()); + correlations.append(_corrName); - _cmx->initialize(_input->getGeneNames(), _maxClusters, correlations); + _cmx->initialize(_input->geneNames(), _maxClusters, correlations); } diff --git a/src/core/similarity.h b/src/core/similarity.h index 9543b7e..52fa5be 100644 --- a/src/core/similarity.h +++ b/src/core/similarity.h @@ -5,29 +5,55 @@ #include "ccmatrix.h" #include "correlationmatrix.h" #include "expressionmatrix.h" -#include "pairwise_clustering.h" -#include "pairwise_correlation.h" -#include "pairwise_gmm.h" -#include "pairwise_pearson.h" +#include "pairwise_clusteringmodel.h" +/*! + * This class implements the similarity analytic. This analytic takes an + * expression matrix and computes a similarity matrix, where each element is + * a similarity measure of two genes in the expression matrix. The similarity + * is computed using a correlation measure. The similarity matrix can also have + * multiple modes within a pair; these modes can be optionally computed using a + * clustering method. This analytic produces two data objects: a correlation + * matrix containing the pairwise correlations, and a cluster matrix containing + * sample masks of the pairwise clusters. Sample masks for unimodal pairs are not + * saved to the cluster matrix. If clustering is not used, an empty cluster matrix + * is created. This analytic can also perform pairwise outlier removal before and + * after clustering, if clustering is used. + * + * This analytic can use MPI and it has both CPU and GPU implementations, as the + * pairwise clustering significantly increases the amount of computations required + * for a large expression matrix. + */ class Similarity : public EAbstractAnalytic { Q_OBJECT public: + /*! + * Defines the pair structure used to send results in result blocks. + */ struct Pair { + /*! + * The number of clusters in a pair. + */ qint8 K; + /*! + * The cluster labels for a pair. + */ QVector labels; + /*! + * The correlation for each cluster in a pair. + */ QVector correlations; }; - class Input; class WorkBlock; class ResultBlock; class Serial; class OpenCL; +public: virtual int size() const override final; virtual std::unique_ptr makeWork(int index) const override final; virtual std::unique_ptr makeWork() const override final; @@ -37,38 +63,110 @@ class Similarity : public EAbstractAnalytic virtual EAbstractAnalytic::Serial* makeSerial() override final; virtual EAbstractAnalytic::OpenCL* makeOpenCL() override final; virtual void initialize() override final; - + virtual void initializeOutputs() override final; private: + /*! + * Defines the clustering methods this analytic supports. + */ enum class ClusteringMethod { + /*! + * No clustering + */ None + /*! + * Gaussian mixture models + */ ,GMM - ,KMeans }; - + /*! + * Defines the correlation methods this analytic supports. + */ enum class CorrelationMethod { + /*! + * Pearson correlation + */ Pearson + /*! + * Spearman rank correlation + */ ,Spearman }; - +private: + qint64 totalPairs(const ExpressionMatrix* emx) const; + /*! + * Pointer to the input expression matrix. + */ ExpressionMatrix* _input {nullptr}; + /*! + * Pointer to the output cluster matrix. + */ CCMatrix* _ccm {nullptr}; + /*! + * Pointer to the output correlation matrix. + */ CorrelationMatrix* _cmx {nullptr}; + /*! + * The clustering method to use. + */ ClusteringMethod _clusMethod {ClusteringMethod::None}; + /*! + * The correlation method to use. + */ CorrelationMethod _corrMethod {CorrelationMethod::Pearson}; - Pairwise::Clustering* _clusModel {nullptr}; - Pairwise::Correlation* _corrModel {new Pairwise::Pearson()}; + /*! + * The name of the correlation method. + */ + QString _corrName; + /*! + * The minimum number of clean samples required to consider a pair. + */ int _minSamples {30}; + /*! + * The minimum expression value required to include a sample. + */ float _minExpression {-std::numeric_limits::infinity()}; + /*! + * The minimum number of clusters to use in the clustering model. + */ qint8 _minClusters {1}; + /*! + * The maximum number of clusters to use in the clustering model. + */ qint8 _maxClusters {5}; + /*! + * The model selection criterion to use in the clustering model. + */ Pairwise::Criterion _criterion {Pairwise::Criterion::ICL}; + /*! + * Whether to remove outliers before clustering. + */ bool _removePreOutliers {false}; + /*! + * Whether to remove outliers after clustering. + */ bool _removePostOutliers {false}; + /*! + * The minimum (absolute) correlation threshold to save a correlation. + */ float _minCorrelation {0.5}; + /*! + * The maximum (absolute) correlation threshold to save a correlation. + */ float _maxCorrelation {1.0}; - int _kernelSize {4096}; + /*! + * The number of pairs to process in each work block. + */ + int _workBlockSize {0}; + /*! + * The global work size for each OpenCL worker. + */ + int _globalWorkSize {4096}; + /*! + * The local work size for each OpenCL worker. + */ + int _localWorkSize {0}; }; diff --git a/src/core/similarity_input.cpp b/src/core/similarity_input.cpp index ac2140c..757c270 100644 --- a/src/core/similarity_input.cpp +++ b/src/core/similarity_input.cpp @@ -1,20 +1,20 @@ #include "similarity_input.h" #include "datafactory.h" -#include "pairwise_gmm.h" -#include "pairwise_kmeans.h" -#include "pairwise_pearson.h" -#include "pairwise_spearman.h" +/*! + * String list of clustering methods for this analytic that correspond exactly + * to its enumeration. Used for handling the clustering method argument for this + * input object. + */ const QStringList Similarity::Input::CLUSTERING_NAMES { "none" ,"gmm" - ,"kmeans" }; @@ -22,6 +22,11 @@ const QStringList Similarity::Input::CLUSTERING_NAMES +/*! + * String list of correlation methods for this analytic that correspond exactly + * to its enumeration. Used for handling the correlation method argument for this + * input object. + */ const QStringList Similarity::Input::CORRELATION_NAMES { "pearson" @@ -33,9 +38,15 @@ const QStringList Similarity::Input::CORRELATION_NAMES +/*! + * String list of criterion options for this analytic that correspond exactly + * to its enumeration. Used for handling the criterion argument for this input + * object. + */ const QStringList Similarity::Input::CRITERION_NAMES { - "BIC" + "AIC" + ,"BIC" ,"ICL" }; @@ -44,10 +55,16 @@ const QStringList Similarity::Input::CRITERION_NAMES +/*! + * Construct a new input object with the given analytic as its parent. + * + * @param parent + */ Similarity::Input::Input(Similarity* parent): EAbstractAnalytic::Input(parent), _base(parent) { + EDEBUG_FUNC(this,parent); } @@ -55,8 +72,13 @@ Similarity::Input::Input(Similarity* parent): +/*! + * Return the total number of arguments this analytic type contains. + */ int Similarity::Input::size() const { + EDEBUG_FUNC(this); + return Total; } @@ -65,8 +87,15 @@ int Similarity::Input::size() const +/*! + * Return the argument type for a given index. + * + * @param index + */ EAbstractAnalytic::Input::Type Similarity::Input::type(int index) const { + EDEBUG_FUNC(this,index); + switch (index) { case InputData: return Type::DataIn; @@ -83,7 +112,9 @@ EAbstractAnalytic::Input::Type Similarity::Input::type(int index) const case RemovePostOutliers: return Type::Boolean; case MinCorrelation: return Type::Double; case MaxCorrelation: return Type::Double; - case KernelSize: return Type::Integer; + case WorkBlockSize: return Type::Integer; + case GlobalWorkSize: return Type::Integer; + case LocalWorkSize: return Type::Integer; default: return Type::Boolean; } } @@ -93,8 +124,16 @@ EAbstractAnalytic::Input::Type Similarity::Input::type(int index) const +/*! + * Return data for a given role on an argument with the given index. + * + * @param index + * @param role + */ QVariant Similarity::Input::data(int index, Role role) const { + EDEBUG_FUNC(this,index,role); + switch (index) { case InputData: @@ -238,17 +277,39 @@ QVariant Similarity::Input::data(int index, Role role) const case Role::Maximum: return 1; default: return QVariant(); } - case KernelSize: + case WorkBlockSize: switch (role) { - case Role::CommandLineName: return QString("ksize"); - case Role::Title: return tr("Kernel Size:"); - case Role::WhatsThis: return tr("(OpenCL) Total number of kernels to run per block."); + case Role::CommandLineName: return QString("bsize"); + case Role::Title: return tr("Work Block Size:"); + case Role::WhatsThis: return tr("Number of pairs to process in each work block."); + case Role::Default: return 0; + case Role::Minimum: return 0; + case Role::Maximum: return std::numeric_limits::max(); + default: return QVariant(); + } + case GlobalWorkSize: + switch (role) + { + case Role::CommandLineName: return QString("gsize"); + case Role::Title: return tr("Global Work Size:"); + case Role::WhatsThis: return tr("The global work size for each OpenCL worker."); case Role::Default: return 4096; case Role::Minimum: return 1; case Role::Maximum: return std::numeric_limits::max(); default: return QVariant(); } + case LocalWorkSize: + switch (role) + { + case Role::CommandLineName: return QString("lsize"); + case Role::Title: return tr("Local Work Size:"); + case Role::WhatsThis: return tr("The local work size for each OpenCL worker."); + case Role::Default: return 0; + case Role::Minimum: return 0; + case Role::Maximum: return std::numeric_limits::max(); + default: return QVariant(); + } default: return QVariant(); } } @@ -258,38 +319,24 @@ QVariant Similarity::Input::data(int index, Role role) const +/*! + * Set an argument with the given index to the given value. + * + * @param index + * @param value + */ void Similarity::Input::set(int index, const QVariant& value) { + EDEBUG_FUNC(this,index,&value); + switch (index) { case ClusteringType: _base->_clusMethod = static_cast(CLUSTERING_NAMES.indexOf(value.toString())); - - switch ( _base->_clusMethod ) - { - case ClusteringMethod::None: - _base->_clusModel = nullptr; - break; - case ClusteringMethod::GMM: - _base->_clusModel = new Pairwise::GMM(); - break; - case ClusteringMethod::KMeans: - _base->_clusModel = new Pairwise::KMeans(); - break; - } break; case CorrelationType: _base->_corrMethod = static_cast(CORRELATION_NAMES.indexOf(value.toString())); - - switch ( _base->_corrMethod ) - { - case CorrelationMethod::Pearson: - _base->_corrModel = new Pairwise::Pearson(); - break; - case CorrelationMethod::Spearman: - _base->_corrModel = new Pairwise::Spearman(); - break; - } + _base->_corrName = value.toString(); break; case MinExpression: _base->_minExpression = value.toDouble(); @@ -318,8 +365,14 @@ void Similarity::Input::set(int index, const QVariant& value) case MaxCorrelation: _base->_maxCorrelation = value.toDouble(); break; - case KernelSize: - _base->_kernelSize = value.toInt(); + case WorkBlockSize: + _base->_workBlockSize = value.toInt(); + break; + case GlobalWorkSize: + _base->_globalWorkSize = value.toInt(); + break; + case LocalWorkSize: + _base->_localWorkSize = value.toInt(); break; } } @@ -329,10 +382,16 @@ void Similarity::Input::set(int index, const QVariant& value) -void Similarity::Input::set(int index, QFile* file) +/*! + * Set a file argument with the given index to the given qt file pointer. This + * implementation does nothing because this analytic has no file arguments. + * + * @param index + * @param file + */ +void Similarity::Input::set(int, QFile*) { - Q_UNUSED(index) - Q_UNUSED(file) + EDEBUG_FUNC(this); } @@ -340,8 +399,16 @@ void Similarity::Input::set(int index, QFile* file) +/*! + * Set a data argument with the given index to the given data object pointer. + * + * @param index + * @param data + */ void Similarity::Input::set(int index, EAbstractData *data) { + EDEBUG_FUNC(this,index,data); + switch (index) { case InputData: diff --git a/src/core/similarity_input.h b/src/core/similarity_input.h index f5ea425..90ffe7f 100644 --- a/src/core/similarity_input.h +++ b/src/core/similarity_input.h @@ -4,10 +4,16 @@ +/*! + * This class implements the abstract input of the similarity analytic. + */ class Similarity::Input : public EAbstractAnalytic::Input { Q_OBJECT public: + /*! + * Defines all arguments for its parent analytic. + */ enum Argument { InputData = 0 @@ -24,7 +30,9 @@ class Similarity::Input : public EAbstractAnalytic::Input ,RemovePostOutliers ,MinCorrelation ,MaxCorrelation - ,KernelSize + ,WorkBlockSize + ,GlobalWorkSize + ,LocalWorkSize ,Total }; explicit Input(Similarity* parent); @@ -34,12 +42,13 @@ class Similarity::Input : public EAbstractAnalytic::Input virtual void set(int index, const QVariant& value) override final; virtual void set(int index, QFile* file) override final; virtual void set(int index, EAbstractData* data) override final; - private: static const QStringList CLUSTERING_NAMES; static const QStringList CORRELATION_NAMES; static const QStringList CRITERION_NAMES; - + /*! + * Pointer to the base analytic for this object. + */ Similarity* _base; }; diff --git a/src/core/similarity_opencl.cpp b/src/core/similarity_opencl.cpp index 41a6833..678b466 100644 --- a/src/core/similarity_opencl.cpp +++ b/src/core/similarity_opencl.cpp @@ -1,4 +1,5 @@ #include "similarity_opencl.h" +#include #include "similarity_opencl_worker.h" @@ -10,10 +11,16 @@ using namespace std; +/*! + * Construct a new OpenCL object with the given analytic as its parent. + * + * @param parent + */ Similarity::OpenCL::OpenCL(Similarity* parent): EAbstractAnalytic::OpenCL(parent), _base(parent) { + EDEBUG_FUNC(this,parent); } @@ -21,8 +28,13 @@ Similarity::OpenCL::OpenCL(Similarity* parent): +/*! + * Create and return a new OpenCL worker for the analytic. + */ std::unique_ptr Similarity::OpenCL::makeWorker() { + EDEBUG_FUNC(this); + return unique_ptr(new Worker(_base, this, _context, _program)); } @@ -31,8 +43,15 @@ std::unique_ptr Similarity::OpenCL::makeWorke +/*! + * Initializes all OpenCL resources used by this object's implementation. + * + * @param context + */ void Similarity::OpenCL::initialize(::OpenCL::Context* context) { + EDEBUG_FUNC(this,context); + // create list of opencl source files QStringList paths { ":/opencl/linalg.cl", @@ -40,7 +59,6 @@ void Similarity::OpenCL::initialize(::OpenCL::Context* context) ":/opencl/sort.cl", ":/opencl/outlier.cl", ":/opencl/gmm.cl", - ":/opencl/kmeans.cl", ":/opencl/pearson.cl", ":/opencl/spearman.cl" }; @@ -53,17 +71,15 @@ void Similarity::OpenCL::initialize(::OpenCL::Context* context) _queue = new ::OpenCL::CommandQueue(context, context->devices().first(), this); // create buffer for expression data - _expressions = ::OpenCL::Buffer(context, _base->_input->getRawSize()); - - unique_ptr rawData(_base->_input->dumpRawData()); - ExpressionMatrix::Expression* rawDataRef {rawData.get()}; + QVector rawData = _base->_input->dumpRawData(); + _expressions = ::OpenCL::Buffer(context,rawData.size()); // copy expression data to device _expressions.mapWrite(_queue).wait(); - for ( int i = 0; i < _base->_input->getRawSize(); ++i ) + for (int i = 0; i < rawData.size() ; ++i ) { - _expressions[i] = rawDataRef[i]; + _expressions[i] = rawData[i]; } _expressions.unmap(_queue).wait(); diff --git a/src/core/similarity_opencl.h b/src/core/similarity_opencl.h index f5aab68..ff4e18b 100644 --- a/src/core/similarity_opencl.h +++ b/src/core/similarity_opencl.h @@ -6,13 +6,16 @@ +/*! + * This class implements the base OpenCL class of the similarity analytic. + */ class Similarity::OpenCL : public EAbstractAnalytic::OpenCL { Q_OBJECT public: class FetchPair; class GMM; - class KMeans; + class Outlier; class Pearson; class Spearman; class Worker; @@ -20,11 +23,25 @@ class Similarity::OpenCL : public EAbstractAnalytic::OpenCL virtual std::unique_ptr makeWorker() override final; virtual void initialize(::OpenCL::Context* context) override final; private: + /*! + * Pointer to the base analytic for this object. + */ Similarity* _base; + /*! + * Pointer to this object's base OpenCL context used to create all other resources. + */ ::OpenCL::Context* _context {nullptr}; + /*! + * Pointer to this object's OpenCL program. + */ ::OpenCL::Program* _program {nullptr}; + /*! + * Pointer to this object's OpenCL command queue. + */ ::OpenCL::CommandQueue* _queue {nullptr}; - + /*! + * Pointer to this object's OpenCL buffer for the expression matrix. + */ ::OpenCL::Buffer _expressions; }; diff --git a/src/core/similarity_opencl_fetchpair.cpp b/src/core/similarity_opencl_fetchpair.cpp index b752165..4f53068 100644 --- a/src/core/similarity_opencl_fetchpair.cpp +++ b/src/core/similarity_opencl_fetchpair.cpp @@ -9,9 +9,17 @@ using namespace std; +/*! + * Construct a new fetch-pair kernel object with the given OpenCL program and + * qt parent. + * + * @param program + * @param parent + */ Similarity::OpenCL::FetchPair::FetchPair(::OpenCL::Program* program, QObject* parent): ::OpenCL::Kernel(program, "fetchPair", parent) { + EDEBUG_FUNC(this,program,parent); } @@ -19,22 +27,52 @@ Similarity::OpenCL::FetchPair::FetchPair(::OpenCL::Program* program, QObject* pa +/*! + * Execute this kernel object's OpenCL kernel using the given OpenCL command + * queue and kernel arguments, returning the OpenCL event associated with the + * kernel execution. + * + * @param queue + * @param globalWorkSize + * @param localWorkSize + * @param expressions + * @param sampleSize + * @param in_index + * @param minExpression + * @param out_X + * @param out_N + * @param out_labels + */ ::OpenCL::Event Similarity::OpenCL::FetchPair::execute( ::OpenCL::CommandQueue* queue, - int kernelSize, + int globalWorkSize, + int localWorkSize, ::OpenCL::Buffer* expressions, cl_int sampleSize, ::OpenCL::Buffer* in_index, cl_int minExpression, - ::OpenCL::Buffer* out_X, + ::OpenCL::Buffer* out_X, ::OpenCL::Buffer* out_N, ::OpenCL::Buffer* out_labels ) { + EDEBUG_FUNC(this, + queue, + globalWorkSize, + localWorkSize, + expressions, + sampleSize, + in_index, + minExpression, + out_X, + out_N, + out_labels); + // acquire lock for this kernel Locker locker {lock()}; // set kernel arguments + setArgument(GlobalWorkSize, globalWorkSize); setBuffer(Expressions, expressions); setArgument(SampleSize, sampleSize); setBuffer(InIndex, in_index); @@ -43,8 +81,15 @@ ::OpenCL::Event Similarity::OpenCL::FetchPair::execute( setBuffer(OutN, out_N); setBuffer(OutLabels, out_labels); - // set kernel sizes - setSizes(0, kernelSize, min(kernelSize, maxWorkGroupSize(queue->device()))); + // set work sizes + if ( localWorkSize == 0 ) + { + localWorkSize = min(globalWorkSize, maxWorkGroupSize(queue->device())); + } + + int numWorkgroups = (globalWorkSize + localWorkSize - 1) / localWorkSize; + + setSizes(0, numWorkgroups * localWorkSize, localWorkSize); // execute kernel return ::OpenCL::Kernel::execute(queue); diff --git a/src/core/similarity_opencl_fetchpair.h b/src/core/similarity_opencl_fetchpair.h index bdab41e..04de1fd 100644 --- a/src/core/similarity_opencl_fetchpair.h +++ b/src/core/similarity_opencl_fetchpair.h @@ -4,13 +4,22 @@ +/*! + * This class implements the fetch-pair kernel for the similarity analytic. This + * kernel takes a list of pairwise indices and computes the pairwise data, the + * number of clean samples, and the initial sample labels for each pair. + */ class Similarity::OpenCL::FetchPair : public ::OpenCL::Kernel { Q_OBJECT public: + /*! + * Defines the arguments passed to the OpenCL kernel. + */ enum Argument { - Expressions + GlobalWorkSize + ,Expressions ,SampleSize ,InIndex ,MinExpression @@ -21,12 +30,13 @@ class Similarity::OpenCL::FetchPair : public ::OpenCL::Kernel explicit FetchPair(::OpenCL::Program* program, QObject* parent = nullptr); ::OpenCL::Event execute( ::OpenCL::CommandQueue* queue, - int kernelSize, + int globalWorkSize, + int localWorkSize, ::OpenCL::Buffer* expressions, cl_int sampleSize, ::OpenCL::Buffer* in_index, cl_int minExpression, - ::OpenCL::Buffer* out_X, + ::OpenCL::Buffer* out_X, ::OpenCL::Buffer* out_N, ::OpenCL::Buffer* out_labels ); diff --git a/src/core/similarity_opencl_gmm.cpp b/src/core/similarity_opencl_gmm.cpp index 31c330d..e18f562 100644 --- a/src/core/similarity_opencl_gmm.cpp +++ b/src/core/similarity_opencl_gmm.cpp @@ -9,9 +9,16 @@ using namespace std; +/*! + * Construct a new GMM kernel object with the given OpenCL program and qt parent. + * + * @param program + * @param parent + */ Similarity::OpenCL::GMM::GMM(::OpenCL::Program* program, QObject* parent): ::OpenCL::Kernel(program, "GMM_compute", parent) { + EDEBUG_FUNC(this,program,parent); } @@ -19,42 +26,81 @@ Similarity::OpenCL::GMM::GMM(::OpenCL::Program* program, QObject* parent): +/*! + * Execute this kernel object's OpenCL kernel using the given OpenCL command + * queue and kernel arguments, returning the OpenCL event associated with the + * kernel execution. + * + * @param queue + * @param globalWorkSize + * @param localWorkSize + * @param sampleSize + * @param minSamples + * @param minClusters + * @param maxClusters + * @param criterion + * @param work_X + * @param work_N + * @param work_labels + * @param work_components + * @param work_MP + * @param work_counts + * @param work_logpi + * @param work_gamma + * @param out_K + * @param out_labels + */ ::OpenCL::Event Similarity::OpenCL::GMM::execute( ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* expressions, + int globalWorkSize, + int localWorkSize, cl_int sampleSize, cl_int minSamples, cl_char minClusters, cl_char maxClusters, - Pairwise::Criterion criterion, - cl_int removePreOutliers, - cl_int removePostOutliers, - ::OpenCL::Buffer* work_X, + cl_int criterion, + ::OpenCL::Buffer* work_X, ::OpenCL::Buffer* work_N, ::OpenCL::Buffer* work_labels, - ::OpenCL::Buffer* work_components, - ::OpenCL::Buffer* work_MP, + ::OpenCL::Buffer* work_components, + ::OpenCL::Buffer* work_MP, ::OpenCL::Buffer* work_counts, ::OpenCL::Buffer* work_logpi, - ::OpenCL::Buffer* work_loggamma, - ::OpenCL::Buffer* work_logGamma, + ::OpenCL::Buffer* work_gamma, ::OpenCL::Buffer* out_K, ::OpenCL::Buffer* out_labels ) { + EDEBUG_FUNC(this, + queue, + globalWorkSize, + localWorkSize, + sampleSize, + minSamples, + minClusters, + maxClusters, + &criterion, + work_X, + work_N, + work_labels, + work_components, + work_MP, + work_counts, + work_logpi, + work_gamma, + out_K, + out_labels); + // acquire lock for this kernel Locker locker {lock()}; // set kernel arguments - setBuffer(Expressions, expressions); + setArgument(GlobalWorkSize, globalWorkSize); setArgument(SampleSize, sampleSize); setArgument(MinSamples, minSamples); setArgument(MinClusters, minClusters); setArgument(MaxClusters, maxClusters); setArgument(Criterion, criterion); - setArgument(RemovePreOutliers, removePreOutliers); - setArgument(RemovePostOutliers, removePostOutliers); setBuffer(WorkX, work_X); setBuffer(WorkN, work_N); setBuffer(WorkLabels, work_labels); @@ -62,13 +108,19 @@ ::OpenCL::Event Similarity::OpenCL::GMM::execute( setBuffer(WorkMP, work_MP); setBuffer(WorkCounts, work_counts); setBuffer(WorkLogPi, work_logpi); - setBuffer(WorkLoggamma, work_loggamma); - setBuffer(WorkLogGamma, work_logGamma); + setBuffer(WorkGamma, work_gamma); setBuffer(OutK, out_K); setBuffer(OutLabels, out_labels); - // set kernel sizes - setSizes(0, kernelSize, min(kernelSize, maxWorkGroupSize(queue->device()))); + // set work sizes + if ( localWorkSize == 0 ) + { + localWorkSize = min(globalWorkSize, maxWorkGroupSize(queue->device())); + } + + int numWorkgroups = (globalWorkSize + localWorkSize - 1) / localWorkSize; + + setSizes(0, numWorkgroups * localWorkSize, localWorkSize); // execute kernel return ::OpenCL::Kernel::execute(queue); diff --git a/src/core/similarity_opencl_gmm.h b/src/core/similarity_opencl_gmm.h index e9a537c..16a5809 100644 --- a/src/core/similarity_opencl_gmm.h +++ b/src/core/similarity_opencl_gmm.h @@ -4,20 +4,40 @@ +typedef struct +{ + cl_float pi; + cl_float2 mu; + cl_float4 sigma; + cl_float4 sigmaInv; + cl_float normalizer; +} cl_component; + + + + + + +/*! + * This class implements the GMM kernel for the similarity analytic. This + * kernel takes a list of pairwise data arrays and computes the number of + * clusters and a list of cluster labels for each pair. + */ class Similarity::OpenCL::GMM : public ::OpenCL::Kernel { Q_OBJECT public: + /*! + * Defines the arguments passed to the OpenCL kernel. + */ enum Argument { - Expressions + GlobalWorkSize ,SampleSize ,MinSamples ,MinClusters ,MaxClusters ,Criterion - ,RemovePreOutliers - ,RemovePostOutliers ,WorkX ,WorkN ,WorkLabels @@ -25,32 +45,28 @@ class Similarity::OpenCL::GMM : public ::OpenCL::Kernel ,WorkMP ,WorkCounts ,WorkLogPi - ,WorkLoggamma - ,WorkLogGamma + ,WorkGamma ,OutK ,OutLabels }; explicit GMM(::OpenCL::Program* program, QObject* parent = nullptr); ::OpenCL::Event execute( ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* expressions, + int globalWorkSize, + int localWorkSize, cl_int sampleSize, cl_int minSamples, cl_char minClusters, cl_char maxClusters, - Pairwise::Criterion criterion, - cl_int removePreOutliers, - cl_int removePostOutliers, - ::OpenCL::Buffer* work_X, + cl_int criterion, + ::OpenCL::Buffer* work_X, ::OpenCL::Buffer* work_N, ::OpenCL::Buffer* work_labels, - ::OpenCL::Buffer* work_components, - ::OpenCL::Buffer* work_MP, + ::OpenCL::Buffer* work_components, + ::OpenCL::Buffer* work_MP, ::OpenCL::Buffer* work_counts, ::OpenCL::Buffer* work_logpi, - ::OpenCL::Buffer* work_loggamma, - ::OpenCL::Buffer* work_logGamma, + ::OpenCL::Buffer* work_gamma, ::OpenCL::Buffer* out_K, ::OpenCL::Buffer* out_labels ); diff --git a/src/core/similarity_opencl_kmeans.cpp b/src/core/similarity_opencl_kmeans.cpp deleted file mode 100644 index 7f43642..0000000 --- a/src/core/similarity_opencl_kmeans.cpp +++ /dev/null @@ -1,65 +0,0 @@ -#include "similarity_opencl_kmeans.h" - - - -using namespace std; - - - - - - -Similarity::OpenCL::KMeans::KMeans(::OpenCL::Program* program, QObject* parent): - ::OpenCL::Kernel(program, "KMeans_compute", parent) -{ -} - - - - - - -::OpenCL::Event Similarity::OpenCL::KMeans::execute( - ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* expressions, - cl_int sampleSize, - cl_int minSamples, - cl_char minClusters, - cl_char maxClusters, - cl_int removePreOutliers, - cl_int removePostOutliers, - ::OpenCL::Buffer* work_X, - ::OpenCL::Buffer* work_N, - ::OpenCL::Buffer* work_outlier, - ::OpenCL::Buffer* work_labels, - ::OpenCL::Buffer* work_means, - ::OpenCL::Buffer* out_K, - ::OpenCL::Buffer* out_labels -) -{ - // acquire lock for this kernel - Locker locker {lock()}; - - // set kernel arguments - setBuffer(Expressions, expressions); - setArgument(SampleSize, sampleSize); - setArgument(MinSamples, minSamples); - setArgument(MinClusters, minClusters); - setArgument(MaxClusters, maxClusters); - setArgument(RemovePreOutliers, removePreOutliers); - setArgument(RemovePostOutliers, removePostOutliers); - setBuffer(WorkX, work_X); - setBuffer(WorkN, work_N); - setBuffer(WorkOutlier, work_outlier); - setBuffer(WorkLabels, work_labels); - setBuffer(WorkMeans, work_means); - setBuffer(OutK, out_K); - setBuffer(OutLabels, out_labels); - - // set kernel sizes - setSizes(0, kernelSize, min(kernelSize, maxWorkGroupSize(queue->device()))); - - // execute kernel - return ::OpenCL::Kernel::execute(queue); -} diff --git a/src/core/similarity_opencl_kmeans.h b/src/core/similarity_opencl_kmeans.h deleted file mode 100644 index 019624c..0000000 --- a/src/core/similarity_opencl_kmeans.h +++ /dev/null @@ -1,51 +0,0 @@ -#ifndef SIMILARITY_OPENCL_KMEANS_H -#define SIMILARITY_OPENCL_KMEANS_H -#include "similarity_opencl.h" - - - -class Similarity::OpenCL::KMeans : public ::OpenCL::Kernel -{ - Q_OBJECT -public: - enum Argument - { - Expressions - ,SampleSize - ,MinSamples - ,MinClusters - ,MaxClusters - ,RemovePreOutliers - ,RemovePostOutliers - ,WorkX - ,WorkN - ,WorkOutlier - ,WorkLabels - ,WorkMeans - ,OutK - ,OutLabels - }; - explicit KMeans(::OpenCL::Program* program, QObject* parent = nullptr); - ::OpenCL::Event execute( - ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* expressions, - cl_int sampleSize, - cl_int minSamples, - cl_char minClusters, - cl_char maxClusters, - cl_int removePreOutliers, - cl_int removePostOutliers, - ::OpenCL::Buffer* work_X, - ::OpenCL::Buffer* work_N, - ::OpenCL::Buffer* work_outlier, - ::OpenCL::Buffer* work_labels, - ::OpenCL::Buffer* work_means, - ::OpenCL::Buffer* out_K, - ::OpenCL::Buffer* out_labels - ); -}; - - - -#endif diff --git a/src/core/similarity_opencl_outlier.cpp b/src/core/similarity_opencl_outlier.cpp new file mode 100644 index 0000000..908ec62 --- /dev/null +++ b/src/core/similarity_opencl_outlier.cpp @@ -0,0 +1,99 @@ +#include "similarity_opencl_outlier.h" + + + +using namespace std; + + + + + + +/*! + * Construct a new Outlier kernel object with the given OpenCL program and qt parent. + * + * @param program + * @param parent + */ +Similarity::OpenCL::Outlier::Outlier(::OpenCL::Program* program, QObject* parent): + ::OpenCL::Kernel(program, "removeOutliers", parent) +{ + EDEBUG_FUNC(this,program,parent); +} + + + + + + +/*! + * Execute this kernel object's OpenCL kernel using the given OpenCL command + * queue and kernel arguments, returning the OpenCL event associated with the + * kernel execution. + * + * @param queue + * @param globalWorkSize + * @param localWorkSize + * @param in_data + * @param in_N + * @param in_labels + * @param sampleSize + * @param in_K + * @param marker + * @param work_x + * @param work_y + */ +::OpenCL::Event Similarity::OpenCL::Outlier::execute( + ::OpenCL::CommandQueue* queue, + int globalWorkSize, + int localWorkSize, + ::OpenCL::Buffer* in_data, + ::OpenCL::Buffer* in_N, + ::OpenCL::Buffer* in_labels, + cl_int sampleSize, + ::OpenCL::Buffer* in_K, + cl_char marker, + ::OpenCL::Buffer* work_x, + ::OpenCL::Buffer* work_y +) +{ + EDEBUG_FUNC(this, + queue, + globalWorkSize, + localWorkSize, + in_data, + in_N, + in_labels, + sampleSize, + in_K, + marker, + work_x, + work_y); + + // acquire lock for this kernel + Locker locker {lock()}; + + // set kernel arguments + setArgument(GlobalWorkSize, globalWorkSize); + setBuffer(InData, in_data); + setBuffer(InN, in_N); + setBuffer(InLabels, in_labels); + setArgument(SampleSize, sampleSize); + setBuffer(InK, in_K); + setArgument(Marker, marker); + setBuffer(WorkX, work_x); + setBuffer(WorkY, work_y); + + // set work sizes + if ( localWorkSize == 0 ) + { + localWorkSize = min(globalWorkSize, maxWorkGroupSize(queue->device())); + } + + int numWorkgroups = (globalWorkSize + localWorkSize - 1) / localWorkSize; + + setSizes(0, numWorkgroups * localWorkSize, localWorkSize); + + // execute kernel + return ::OpenCL::Kernel::execute(queue); +} diff --git a/src/core/similarity_opencl_outlier.h b/src/core/similarity_opencl_outlier.h new file mode 100644 index 0000000..26894d6 --- /dev/null +++ b/src/core/similarity_opencl_outlier.h @@ -0,0 +1,47 @@ +#ifndef SIMILARITY_OPENCL_OUTLIER_H +#define SIMILARITY_OPENCL_OUTLIER_H +#include "similarity_opencl.h" + + + +/*! + * This class implements the outlier removal kernel for the similarity analytic. + */ +class Similarity::OpenCL::Outlier : public ::OpenCL::Kernel +{ + Q_OBJECT +public: + /*! + * Defines the arguments passed to the OpenCL kernel. + */ + enum Argument + { + GlobalWorkSize + ,InData + ,InN + ,InLabels + ,SampleSize + ,InK + ,Marker + ,WorkX + ,WorkY + }; + explicit Outlier(::OpenCL::Program* program, QObject* parent = nullptr); + ::OpenCL::Event execute( + ::OpenCL::CommandQueue* queue, + int globalWorkSize, + int localWorkSize, + ::OpenCL::Buffer* in_data, + ::OpenCL::Buffer* in_N, + ::OpenCL::Buffer* in_labels, + cl_int sampleSize, + ::OpenCL::Buffer* in_K, + cl_char marker, + ::OpenCL::Buffer* work_x, + ::OpenCL::Buffer* work_y + ); +}; + + + +#endif diff --git a/src/core/similarity_opencl_pearson.cpp b/src/core/similarity_opencl_pearson.cpp index 640993c..3caff15 100644 --- a/src/core/similarity_opencl_pearson.cpp +++ b/src/core/similarity_opencl_pearson.cpp @@ -9,9 +9,17 @@ using namespace std; +/*! + * Construct a new Pearson kernel object with the given OpenCL program and + * qt parent. + * + * @param program + * @param parent + */ Similarity::OpenCL::Pearson::Pearson(::OpenCL::Program* program, QObject* parent): ::OpenCL::Kernel(program, "Pearson_compute", parent) { + EDEBUG_FUNC(this,program,parent); } @@ -19,10 +27,26 @@ Similarity::OpenCL::Pearson::Pearson(::OpenCL::Program* program, QObject* parent +/*! + * Execute this kernel object's OpenCL kernel using the given OpenCL command + * queue and kernel arguments, returning the OpenCL event associated with the + * kernel execution. + * + * @param queue + * @param globalWorkSize + * @param localWorkSize + * @param in_data + * @param clusterSize + * @param in_labels + * @param sampleSize + * @param minSamples + * @param out_correlations + */ ::OpenCL::Event Similarity::OpenCL::Pearson::execute( ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* in_data, + int globalWorkSize, + int localWorkSize, + ::OpenCL::Buffer* in_data, cl_char clusterSize, ::OpenCL::Buffer* in_labels, cl_int sampleSize, @@ -30,10 +54,22 @@ ::OpenCL::Event Similarity::OpenCL::Pearson::execute( ::OpenCL::Buffer* out_correlations ) { + EDEBUG_FUNC(this, + queue, + globalWorkSize, + localWorkSize, + in_data, + clusterSize, + in_labels, + sampleSize, + minSamples, + out_correlations); + // acquire lock for this kernel Locker locker {lock()}; // set kernel arguments + setArgument(GlobalWorkSize, globalWorkSize); setBuffer(InData, in_data); setArgument(ClusterSize, clusterSize); setBuffer(InLabels, in_labels); @@ -41,8 +77,15 @@ ::OpenCL::Event Similarity::OpenCL::Pearson::execute( setArgument(MinSamples, minSamples); setBuffer(OutCorrelations, out_correlations); - // set kernel sizes - setSizes(0, kernelSize, min(kernelSize, maxWorkGroupSize(queue->device()))); + // set work sizes + if ( localWorkSize == 0 ) + { + localWorkSize = min(globalWorkSize, maxWorkGroupSize(queue->device())); + } + + int numWorkgroups = (globalWorkSize + localWorkSize - 1) / localWorkSize; + + setSizes(0, numWorkgroups * localWorkSize, localWorkSize); // execute kernel return ::OpenCL::Kernel::execute(queue); diff --git a/src/core/similarity_opencl_pearson.h b/src/core/similarity_opencl_pearson.h index f54bc0f..93824c6 100644 --- a/src/core/similarity_opencl_pearson.h +++ b/src/core/similarity_opencl_pearson.h @@ -4,13 +4,22 @@ +/*! + * This class implements the Pearson kernel for the similarity analytic. This + * kernel takes a list of pairwise data arrays (with cluster labels) and computes + * the Pearson correlation for each cluster in each pair. + */ class Similarity::OpenCL::Pearson : public ::OpenCL::Kernel { Q_OBJECT public: + /*! + * Defines the arguments passed to the OpenCL kernel. + */ enum Argument { - InData + GlobalWorkSize + ,InData ,ClusterSize ,InLabels ,SampleSize @@ -20,8 +29,9 @@ class Similarity::OpenCL::Pearson : public ::OpenCL::Kernel explicit Pearson(::OpenCL::Program* program, QObject* parent = nullptr); ::OpenCL::Event execute( ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* in_data, + int globalWorkSize, + int localWorkSize, + ::OpenCL::Buffer* in_data, cl_char clusterSize, ::OpenCL::Buffer* in_labels, cl_int sampleSize, diff --git a/src/core/similarity_opencl_spearman.cpp b/src/core/similarity_opencl_spearman.cpp index 44bf263..24c5903 100644 --- a/src/core/similarity_opencl_spearman.cpp +++ b/src/core/similarity_opencl_spearman.cpp @@ -9,9 +9,17 @@ using namespace std; +/*! + * Construct a new Spearman kernel object with the given OpenCL program and + * qt parent. + * + * @param program + * @param parent + */ Similarity::OpenCL::Spearman::Spearman(::OpenCL::Program* program, QObject* parent): ::OpenCL::Kernel(program, "Spearman_compute", parent) { + EDEBUG_FUNC(this,parent); } @@ -19,10 +27,29 @@ Similarity::OpenCL::Spearman::Spearman(::OpenCL::Program* program, QObject* pare +/*! + * Execute this kernel object's OpenCL kernel using the given OpenCL command + * queue and kernel arguments, returning the OpenCL event associated with the + * kernel execution. + * + * @param queue + * @param globalWorkSize + * @param localWorkSize + * @param in_data + * @param clusterSize + * @param in_labels + * @param sampleSize + * @param minSamples + * @param work_x + * @param work_y + * @param work_rank + * @param out_correlations + */ ::OpenCL::Event Similarity::OpenCL::Spearman::execute( ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* in_data, + int globalWorkSize, + int localWorkSize, + ::OpenCL::Buffer* in_data, cl_char clusterSize, ::OpenCL::Buffer* in_labels, cl_int sampleSize, @@ -33,10 +60,25 @@ ::OpenCL::Event Similarity::OpenCL::Spearman::execute( ::OpenCL::Buffer* out_correlations ) { + EDEBUG_FUNC(this, + queue, + globalWorkSize, + localWorkSize, + in_data, + clusterSize, + in_labels, + sampleSize, + minSamples, + work_x, + work_y, + work_rank, + out_correlations); + // acquire lock for this kernel Locker locker {lock()}; // set kernel arguments + setArgument(GlobalWorkSize, globalWorkSize); setBuffer(InData, in_data); setArgument(ClusterSize, clusterSize); setBuffer(InLabels, in_labels); @@ -47,8 +89,15 @@ ::OpenCL::Event Similarity::OpenCL::Spearman::execute( setBuffer(WorkRank, work_rank); setBuffer(OutCorrelations, out_correlations); - // set kernel sizes - setSizes(0, kernelSize, min(kernelSize, maxWorkGroupSize(queue->device()))); + // set work sizes + if ( localWorkSize == 0 ) + { + localWorkSize = min(globalWorkSize, maxWorkGroupSize(queue->device())); + } + + int numWorkgroups = (globalWorkSize + localWorkSize - 1) / localWorkSize; + + setSizes(0, numWorkgroups * localWorkSize, localWorkSize); // execute kernel return ::OpenCL::Kernel::execute(queue); diff --git a/src/core/similarity_opencl_spearman.h b/src/core/similarity_opencl_spearman.h index d1a198e..e1d6693 100644 --- a/src/core/similarity_opencl_spearman.h +++ b/src/core/similarity_opencl_spearman.h @@ -4,13 +4,22 @@ +/*! + * This class implements the Pearson kernel for the similarity analytic. This + * kernel takes a list of pairwise data arrays (with cluster labels) and computes + * the Spearman correlation for each cluster in each pair. + */ class Similarity::OpenCL::Spearman : public ::OpenCL::Kernel { Q_OBJECT public: + /*! + * Defines the arguments passed to the OpenCL kernel. + */ enum Argument { - InData + GlobalWorkSize + ,InData ,ClusterSize ,InLabels ,SampleSize @@ -23,8 +32,9 @@ class Similarity::OpenCL::Spearman : public ::OpenCL::Kernel explicit Spearman(::OpenCL::Program* program, QObject* parent = nullptr); ::OpenCL::Event execute( ::OpenCL::CommandQueue* queue, - int kernelSize, - ::OpenCL::Buffer* in_data, + int globalWorkSize, + int localWorkSize, + ::OpenCL::Buffer* in_data, cl_char clusterSize, ::OpenCL::Buffer* in_labels, cl_int sampleSize, diff --git a/src/core/similarity_opencl_worker.cpp b/src/core/similarity_opencl_worker.cpp index 994175f..171f9e3 100644 --- a/src/core/similarity_opencl_worker.cpp +++ b/src/core/similarity_opencl_worker.cpp @@ -1,11 +1,8 @@ #include "similarity_opencl_worker.h" -#include "similarity_opencl_fetchpair.h" -#include "similarity_opencl_gmm.h" -#include "similarity_opencl_kmeans.h" -#include "similarity_opencl_pearson.h" -#include "similarity_opencl_spearman.h" #include "similarity_resultblock.h" #include "similarity_workblock.h" +#include +#include "pairwise_spearman.h" @@ -16,73 +13,50 @@ using namespace std; - -int nextPower2(int n) -{ - int pow2 = 2; - while ( pow2 < n ) - { - pow2 *= 2; - } - - return pow2; -} - - - - - - -template -QVector createVector(const T* data, int size) -{ - QVector v(size); - - memcpy(v.data(), data, size * sizeof(T)); - return v; -} - - - - - - +/*! + * Construct a new OpenCL worker with the given parent analytic, OpenCL object, + * OpenCL context, and OpenCL program. + * + * @param base + * @param baseOpenCL + * @param context + * @param program + */ Similarity::OpenCL::Worker::Worker(Similarity* base, Similarity::OpenCL* baseOpenCL, ::OpenCL::Context* context, ::OpenCL::Program* program): _base(base), _baseOpenCL(baseOpenCL), _queue(new ::OpenCL::CommandQueue(context, context->devices().first(), this)) { + EDEBUG_FUNC(this,base,baseOpenCL,context,program); + // initialize kernels _kernels.fetchPair = new OpenCL::FetchPair(program, this); _kernels.gmm = new OpenCL::GMM(program, this); - _kernels.kmeans = new OpenCL::KMeans(program, this); + _kernels.outlier = new OpenCL::Outlier(program, this); _kernels.pearson = new OpenCL::Pearson(program, this); _kernels.spearman = new OpenCL::Spearman(program, this); // initialize buffers - int kernelSize {_base->_kernelSize}; - int N {_base->_input->getSampleSize()}; - int N_pow2 {nextPower2(N)}; + int W {_base->_globalWorkSize}; + int N {_base->_input->sampleSize()}; + int N_pow2 {Pairwise::Spearman::nextPower2(N)}; int K {_base->_maxClusters}; - _buffers.in_index = ::OpenCL::Buffer(context, 1 * kernelSize); - - _buffers.work_X = ::OpenCL::Buffer(context, N * kernelSize); - _buffers.work_N = ::OpenCL::Buffer(context, 1 * kernelSize); - _buffers.work_labels = ::OpenCL::Buffer(context, N * kernelSize); - _buffers.work_components = ::OpenCL::Buffer(context, K * kernelSize); - _buffers.work_MP = ::OpenCL::Buffer(context, K * kernelSize); - _buffers.work_counts = ::OpenCL::Buffer(context, K * kernelSize); - _buffers.work_logpi = ::OpenCL::Buffer(context, K * kernelSize); - _buffers.work_loggamma = ::OpenCL::Buffer(context, N * K * kernelSize); - _buffers.work_logGamma = ::OpenCL::Buffer(context, K * kernelSize); - _buffers.out_K = ::OpenCL::Buffer(context, 1 * kernelSize); - _buffers.out_labels = ::OpenCL::Buffer(context, N * kernelSize); - - _buffers.work_x = ::OpenCL::Buffer(context, N_pow2 * kernelSize); - _buffers.work_y = ::OpenCL::Buffer(context, N_pow2 * kernelSize); - _buffers.work_rank = ::OpenCL::Buffer(context, N_pow2 * kernelSize); - _buffers.out_correlations = ::OpenCL::Buffer(context, K * kernelSize); + _buffers.in_index = ::OpenCL::Buffer(context, 1 * W); + _buffers.work_X = ::OpenCL::Buffer(context, N * W); + _buffers.work_N = ::OpenCL::Buffer(context, 1 * W); + _buffers.work_x = ::OpenCL::Buffer(context, N_pow2 * W); + _buffers.work_y = ::OpenCL::Buffer(context, N_pow2 * W); + _buffers.work_labels = ::OpenCL::Buffer(context, N * W); + _buffers.work_components = ::OpenCL::Buffer(context, K * W); + _buffers.work_MP = ::OpenCL::Buffer(context, K * W); + _buffers.work_counts = ::OpenCL::Buffer(context, K * W); + _buffers.work_logpi = ::OpenCL::Buffer(context, K * W); + _buffers.work_gamma = ::OpenCL::Buffer(context, N * K * W); + _buffers.work_rank = ::OpenCL::Buffer(context, N_pow2 * W); + _buffers.out_K = ::OpenCL::Buffer(context, 1 * W); + _buffers.out_labels = ::OpenCL::Buffer(context, N * W); + _buffers.out_correlations = ::OpenCL::Buffer(context, K * W); } @@ -90,8 +64,22 @@ Similarity::OpenCL::Worker::Worker(Similarity* base, Similarity::OpenCL* baseOpe +/*! + * Read in the given work block, execute the algorithms necessary to produce + * results using OpenCL acceleration, and save those results in a new result + * block whose pointer is returned. + * + * @param block + */ std::unique_ptr Similarity::OpenCL::Worker::execute(const EAbstractAnalytic::Block* block) { + EDEBUG_FUNC(this,block); + + if ( ELog::isActive() ) + { + ELog() << tr("Executing(OpenCL) work index %1.\n").arg(block->index()); + } + // cast block to work block const WorkBlock* workBlock {block->cast()}; @@ -101,32 +89,28 @@ std::unique_ptr Similarity::OpenCL::Worker::execute(co // iterate through all pairs Pairwise::Index index {workBlock->start()}; - for ( int i = 0; i < workBlock->size(); i += _base->_kernelSize ) + for ( int i = 0; i < workBlock->size(); i += _base->_globalWorkSize ) { // write input buffers to device - int steps {min(_base->_kernelSize, (int)workBlock->size() - i)}; + int globalWorkSize {(int) min((qint64)_base->_globalWorkSize, workBlock->size() - i)}; _buffers.in_index.mapWrite(_queue).wait(); - for ( int j = 0; j < steps; ++j ) + for ( int j = 0; j < globalWorkSize; ++j ) { _buffers.in_index[j] = { index.getX(), index.getY() }; ++index; } - for ( int j = steps; j < _base->_kernelSize; ++j ) - { - _buffers.in_index[j] = { 0, 0 }; - } - _buffers.in_index.unmap(_queue).wait(); // execute fetch-pair kernel _kernels.fetchPair->execute( _queue, - _base->_kernelSize, + globalWorkSize, + _base->_localWorkSize, &_baseOpenCL->_expressions, - _base->_input->getSampleSize(), + _base->_input->sampleSize(), &_buffers.in_index, _base->_minExpression, &_buffers.work_X, @@ -134,50 +118,44 @@ std::unique_ptr Similarity::OpenCL::Worker::execute(co &_buffers.out_labels ).wait(); - // execute clustering kernel - if ( _base->_clusMethod == ClusteringMethod::GMM ) + // execute outlier kernel (pre-clustering) + if ( _base->_removePreOutliers ) { - _kernels.gmm->execute( + _kernels.outlier->execute( _queue, - _base->_kernelSize, - &_baseOpenCL->_expressions, - _base->_input->getSampleSize(), - _base->_minSamples, - _base->_minClusters, - _base->_maxClusters, - _base->_criterion, - _base->_removePreOutliers, - _base->_removePostOutliers, + globalWorkSize, + _base->_localWorkSize, &_buffers.work_X, &_buffers.work_N, - &_buffers.work_labels, - &_buffers.work_components, - &_buffers.work_MP, - &_buffers.work_counts, - &_buffers.work_logpi, - &_buffers.work_loggamma, - &_buffers.work_logGamma, + &_buffers.out_labels, + _base->_input->sampleSize(), &_buffers.out_K, - &_buffers.out_labels - ).wait(); + -7, + &_buffers.work_x, + &_buffers.work_y + ); } - else if ( _base->_clusMethod == ClusteringMethod::KMeans ) + + // execute clustering kernel + if ( _base->_clusMethod == ClusteringMethod::GMM ) { - _kernels.kmeans->execute( + _kernels.gmm->execute( _queue, - _base->_kernelSize, - &_baseOpenCL->_expressions, - _base->_input->getSampleSize(), + globalWorkSize, + _base->_localWorkSize, + _base->_input->sampleSize(), _base->_minSamples, _base->_minClusters, _base->_maxClusters, - _base->_removePreOutliers, - _base->_removePostOutliers, + (cl_int) _base->_criterion, &_buffers.work_X, &_buffers.work_N, - &_buffers.work_loggamma, &_buffers.work_labels, + &_buffers.work_components, &_buffers.work_MP, + &_buffers.work_counts, + &_buffers.work_logpi, + &_buffers.work_gamma, &_buffers.out_K, &_buffers.out_labels ).wait(); @@ -187,7 +165,7 @@ std::unique_ptr Similarity::OpenCL::Worker::execute(co // set cluster size to 1 if clustering is disabled _buffers.out_K.mapWrite(_queue).wait(); - for ( int i = 0; i < _base->_kernelSize; ++i ) + for ( int i = 0; i < globalWorkSize; ++i ) { _buffers.out_K[i] = 1; } @@ -195,16 +173,35 @@ std::unique_ptr Similarity::OpenCL::Worker::execute(co _buffers.out_K.unmap(_queue).wait(); } + // execute outlier kernel (post-clustering) + if ( _base->_removePostOutliers ) + { + _kernels.outlier->execute( + _queue, + globalWorkSize, + _base->_localWorkSize, + &_buffers.work_X, + &_buffers.work_N, + &_buffers.out_labels, + _base->_input->sampleSize(), + &_buffers.out_K, + -8, + &_buffers.work_x, + &_buffers.work_y + ); + } + // execute correlation kernel if ( _base->_corrMethod == CorrelationMethod::Pearson ) { _kernels.pearson->execute( _queue, - _base->_kernelSize, + globalWorkSize, + _base->_localWorkSize, &_buffers.work_X, _base->_maxClusters, &_buffers.out_labels, - _base->_input->getSampleSize(), + _base->_input->sampleSize(), _base->_minSamples, &_buffers.out_correlations ); @@ -213,11 +210,12 @@ std::unique_ptr Similarity::OpenCL::Worker::execute(co { _kernels.spearman->execute( _queue, - _base->_kernelSize, + globalWorkSize, + _base->_localWorkSize, &_buffers.work_X, _base->_maxClusters, &_buffers.out_labels, - _base->_input->getSampleSize(), + _base->_input->sampleSize(), _base->_minSamples, &_buffers.work_x, &_buffers.work_y, @@ -236,22 +234,27 @@ std::unique_ptr Similarity::OpenCL::Worker::execute(co e3.wait(); // save results - for ( int j = 0; j < steps; ++j ) + for ( int j = 0; j < globalWorkSize; ++j ) { - const qint8 *labels = &_buffers.out_labels.at(j * _base->_input->getSampleSize()); + // get pointers to the cluster labels and correlations for this pair + const qint8 *labels = &_buffers.out_labels.at(j * _base->_input->sampleSize()); const float *correlations = &_buffers.out_correlations.at(j * _base->_maxClusters); Pair pair; + + // save the number of clusters pair.K = _buffers.out_K.at(j); + // save the cluster labels (if more than one cluster was found) if ( pair.K > 1 ) { - pair.labels = createVector(labels, _base->_input->getSampleSize()); + pair.labels = ResultBlock::makeVector(labels, _base->_input->sampleSize()); } + // save the correlations (if the pair was able to be processed) if ( pair.K > 0 ) { - pair.correlations = createVector(correlations, _base->_maxClusters); + pair.correlations = ResultBlock::makeVector(correlations, _base->_maxClusters); } resultBlock->append(pair); diff --git a/src/core/similarity_opencl_worker.h b/src/core/similarity_opencl_worker.h index 8e68aeb..c084363 100644 --- a/src/core/similarity_opencl_worker.h +++ b/src/core/similarity_opencl_worker.h @@ -1,9 +1,17 @@ #ifndef SIMILARITY_OPENCL_WORKER_H #define SIMILARITY_OPENCL_WORKER_H #include "similarity_opencl.h" +#include "similarity_opencl_fetchpair.h" +#include "similarity_opencl_gmm.h" +#include "similarity_opencl_outlier.h" +#include "similarity_opencl_pearson.h" +#include "similarity_opencl_spearman.h" +/*! + * This class implements the OpenCL worker of the similarity analytic. + */ class Similarity::OpenCL::Worker : public EAbstractAnalytic::OpenCL::Worker { Q_OBJECT @@ -11,41 +19,48 @@ class Similarity::OpenCL::Worker : public EAbstractAnalytic::OpenCL::Worker explicit Worker(Similarity* base, Similarity::OpenCL* baseOpenCL, ::OpenCL::Context* context, ::OpenCL::Program* program); virtual std::unique_ptr execute(const EAbstractAnalytic::Block* block) override final; private: + /*! + * Pointer to the base analytic. + */ Similarity* _base; + /*! + * Pointer to the base OpenCL object. + */ Similarity::OpenCL* _baseOpenCL; + /*! + * Pointer to this worker's unique and private command queue. + */ ::OpenCL::CommandQueue* _queue; - + /*! + * Structure of this worker's kernels. + */ struct { OpenCL::FetchPair* fetchPair; OpenCL::GMM* gmm; - OpenCL::KMeans* kmeans; + OpenCL::Outlier* outlier; OpenCL::Pearson* pearson; OpenCL::Spearman* spearman; } _kernels; - + /*! + * Structure of this worker's buffers. + */ struct { - // input buffers ::OpenCL::Buffer in_index; - - // clustering buffers - ::OpenCL::Buffer work_X; + ::OpenCL::Buffer work_X; ::OpenCL::Buffer work_N; + ::OpenCL::Buffer work_x; + ::OpenCL::Buffer work_y; ::OpenCL::Buffer work_labels; - ::OpenCL::Buffer work_components; - ::OpenCL::Buffer work_MP; + ::OpenCL::Buffer work_components; + ::OpenCL::Buffer work_MP; ::OpenCL::Buffer work_counts; ::OpenCL::Buffer work_logpi; - ::OpenCL::Buffer work_loggamma; - ::OpenCL::Buffer work_logGamma; + ::OpenCL::Buffer work_gamma; + ::OpenCL::Buffer work_rank; ::OpenCL::Buffer out_K; ::OpenCL::Buffer out_labels; - - // correlation buffers - ::OpenCL::Buffer work_x; - ::OpenCL::Buffer work_y; - ::OpenCL::Buffer work_rank; ::OpenCL::Buffer out_correlations; } _buffers; }; diff --git a/src/core/similarity_resultblock.cpp b/src/core/similarity_resultblock.cpp index b24eca8..c76ede6 100644 --- a/src/core/similarity_resultblock.cpp +++ b/src/core/similarity_resultblock.cpp @@ -5,10 +5,17 @@ +/*! + * Construct a new block with the given index and starting pairwise index. + * + * @param index + * @param start + */ Similarity::ResultBlock::ResultBlock(int index, qint64 start): EAbstractAnalytic::Block(index), _start(start) { + EDEBUG_FUNC(this,index,start); } @@ -16,8 +23,15 @@ Similarity::ResultBlock::ResultBlock(int index, qint64 start): +/*! + * Append a pair to the result block's list of pairs. + * + * @param pair + */ void Similarity::ResultBlock::append(const Pair& pair) { + EDEBUG_FUNC(this,&pair); + _pairs.append(pair); } @@ -26,8 +40,15 @@ void Similarity::ResultBlock::append(const Pair& pair) +/*! + * Write this block's data to the given data stream. + * + * @param stream + */ void Similarity::ResultBlock::write(QDataStream& stream) const { + EDEBUG_FUNC(this,&stream); + stream << _start; stream << _pairs.size(); @@ -44,8 +65,15 @@ void Similarity::ResultBlock::write(QDataStream& stream) const +/*! + * Read this block's data from the given data stream. + * + * @param stream + */ void Similarity::ResultBlock::read(QDataStream& stream) { + EDEBUG_FUNC(this,&stream); + stream >> _start; int size; diff --git a/src/core/similarity_resultblock.h b/src/core/similarity_resultblock.h index add5811..caf7353 100644 --- a/src/core/similarity_resultblock.h +++ b/src/core/similarity_resultblock.h @@ -4,12 +4,19 @@ +/*! + * This class implements the result block of the similarity analytic. + */ class Similarity::ResultBlock : public EAbstractAnalytic::Block { Q_OBJECT public: + /*! + * Construct a new result block in an uninitialized null state. + */ explicit ResultBlock() = default; explicit ResultBlock(int index, qint64 start); + template static QVector makeVector(const T* data, int size); qint64 start() const { return _start; } const QVector& pairs() const { return _pairs; } QVector& pairs() { return _pairs; } @@ -18,10 +25,37 @@ class Similarity::ResultBlock : public EAbstractAnalytic::Block virtual void write(QDataStream& stream) const override final; virtual void read(QDataStream& stream) override final; private: + /*! + * The pairwise index of the first pair in the result block. + */ qint64 _start; + /*! + * The list of pairs that were processed. + */ QVector _pairs; }; + + + +/*! + * Create a vector from the given pointer and size. The contents of the + * pointer are copied into the vector. + * + * @param data + * @param size + */ +template +QVector Similarity::ResultBlock::makeVector(const T* data, int size) +{ + QVector v(size); + + memcpy(v.data(), data, size * sizeof(T)); + return v; +} + + + #endif diff --git a/src/core/similarity_serial.cpp b/src/core/similarity_serial.cpp index 92c7349..558a18f 100644 --- a/src/core/similarity_serial.cpp +++ b/src/core/similarity_serial.cpp @@ -1,6 +1,11 @@ #include "similarity_serial.h" #include "similarity_resultblock.h" #include "similarity_workblock.h" +#include "expressionmatrix_gene.h" +#include "pairwise_gmm.h" +#include "pairwise_pearson.h" +#include "pairwise_spearman.h" +#include @@ -11,18 +16,38 @@ using namespace std; +/*! + * Construct a new serial object with the given analytic as its parent. + * + * @param parent + */ Similarity::Serial::Serial(Similarity* parent): EAbstractAnalytic::Serial(parent), _base(parent) { + EDEBUG_FUNC(this,parent); + // initialize clustering model - if ( _base->_clusMethod != ClusteringMethod::None ) + switch ( _base->_clusMethod ) { - _base->_clusModel->initialize(_base->_input); + case ClusteringMethod::None: + _clusModel = nullptr; + break; + case ClusteringMethod::GMM: + _clusModel = new Pairwise::GMM(_base->_input); + break; } // initialize correlation model - _base->_corrModel->initialize(_base->_input); + switch ( _base->_corrMethod ) + { + case CorrelationMethod::Pearson: + _corrModel = new Pairwise::Pearson(); + break; + case CorrelationMethod::Spearman: + _corrModel = new Pairwise::Spearman(_base->_input); + break; + } } @@ -30,8 +55,22 @@ Similarity::Serial::Serial(Similarity* parent): +/*! + * Read in the given work block and save the results in a new result block. This + * implementation takes the starting pairwise index and pair size from the work + * block and processes those pairs. + * + * @param block + */ std::unique_ptr Similarity::Serial::execute(const EAbstractAnalytic::Block* block) { + EDEBUG_FUNC(this,block); + + if ( ELog::isActive() ) + { + ELog() << tr("Executing(serial) work index %1.\n").arg(block->index()); + } + // cast block to work block const WorkBlock* workBlock {block->cast()}; @@ -39,8 +78,8 @@ std::unique_ptr Similarity::Serial::execute(const EAbs ResultBlock* resultBlock {new ResultBlock(workBlock->index(), workBlock->start())}; // initialize workspace - QVector X(_base->_input->getSampleSize()); - QVector labels(_base->_input->getSampleSize()); + QVector data(_base->_input->sampleSize()); + QVector labels(_base->_input->sampleSize()); // iterate through all pairs Pairwise::Index index {workBlock->start()}; @@ -48,29 +87,39 @@ std::unique_ptr Similarity::Serial::execute(const EAbs for ( int i = 0; i < workBlock->size(); ++i ) { // fetch pairwise input data - int numSamples = fetchPair(index, X, labels); + int numSamples = fetchPair(index, data, labels); + + // remove pre-clustering outliers + if ( _base->_removePreOutliers ) + { + numSamples = removeOutliers(data, numSamples, labels, 1, -7); + } // compute clusters qint8 K {1}; if ( _base->_clusMethod != ClusteringMethod::None ) { - K = _base->_clusModel->compute( - X, + K = _clusModel->compute( + data, numSamples, labels, _base->_minSamples, _base->_minClusters, _base->_maxClusters, - _base->_criterion, - _base->_removePreOutliers, - _base->_removePostOutliers + _base->_criterion ); } + // remove post-clustering outliers + if ( _base->_removePostOutliers ) + { + numSamples = removeOutliers(data, numSamples, labels, K, -8); + } + // compute correlations - QVector correlations = _base->_corrModel->compute( - X, + QVector correlations = _corrModel->compute( + data, K, labels, _base->_minSamples @@ -105,8 +154,19 @@ std::unique_ptr Similarity::Serial::execute(const EAbs -int Similarity::Serial::fetchPair(Pairwise::Index index, QVector& X, QVector& labels) +/*! + * Extract pairwise data from an expression matrix given a pairwise index. Samples + * with missing values and samples that fall below the expression threshold are + * excluded. The number of extracted samples is returned. + * + * @param index + * @param data + * @param labels + */ +int Similarity::Serial::fetchPair(const Pairwise::Index& index, QVector& data, QVector& labels) { + EDEBUG_FUNC(this,&index,&data,&labels); + // read in gene expressions ExpressionMatrix::Gene gene1(_base->_input); ExpressionMatrix::Gene gene2(_base->_input); @@ -114,28 +174,159 @@ int Similarity::Serial::fetchPair(Pairwise::Index index, QVector_input->getSampleSize(); ++i ) + for ( int i = 0; i < _base->_input->sampleSize(); ++i ) { + // exclude samples with missing values if ( std::isnan(gene1.at(i)) || std::isnan(gene2.at(i)) ) { labels[i] = -9; } + + // exclude samples which fall below the expression threshold else if ( gene1.at(i) < _base->_minExpression || gene2.at(i) < _base->_minExpression ) { labels[i] = -6; } + + // include any remaining samples else { - X[numSamples] = { gene1.at(i), gene2.at(i) }; + data[numSamples] = { gene1.at(i), gene2.at(i) }; numSamples++; labels[i] = 0; } } - // return size of X + // return number of extracted samples + return numSamples; +} + + + + + + +/*! + * Remove outliers from a vector of pairwise data. Outliers are detected independently + * on each axis using the Tukey method, and marked with the given marker. Only the + * samples in the given cluster are used in outlier detection. For unclustered data, + * all samples are labeled as 0, so a cluster value of 0 should be used. The data + * array should only contain samples that have a non-negative label. + * + * @param data + * @param labels + * @param cluster + * @param marker + */ +int Similarity::Serial::removeOutliersCluster(QVector& data, QVector& labels, qint8 cluster, qint8 marker) +{ + EDEBUG_FUNC(this,&data,&labels,cluster,marker); + + // extract univariate data from the given cluster + QVector x_sorted; + QVector y_sorted; + + x_sorted.reserve(labels.size()); + y_sorted.reserve(labels.size()); + + for ( int i = 0, j = 0; i < labels.size(); i++ ) + { + if ( labels[i] >= 0 ) + { + if ( labels[i] == cluster ) + { + x_sorted.append(data[j].s[0]); + y_sorted.append(data[j].s[1]); + } + + j++; + } + } + + // return if the given cluster is empty + if ( x_sorted.size() == 0 || y_sorted.size() == 0 ) + { + return 0; + } + + // sort samples for each axis + std::sort(x_sorted.begin(), x_sorted.end()); + std::sort(y_sorted.begin(), y_sorted.end()); + + // compute quartiles and thresholds for each axis + const int n = x_sorted.size(); + + float Q1_x = x_sorted[n * 1 / 4]; + float Q3_x = x_sorted[n * 3 / 4]; + float T_x_min = Q1_x - 1.5f * (Q3_x - Q1_x); + float T_x_max = Q3_x + 1.5f * (Q3_x - Q1_x); + + float Q1_y = y_sorted[n * 1 / 4]; + float Q3_y = y_sorted[n * 3 / 4]; + float T_y_min = Q1_y - 1.5f * (Q3_y - Q1_y); + float T_y_max = Q3_y + 1.5f * (Q3_y - Q1_y); + + // remove outliers + int numSamples = 0; + + for ( int i = 0, j = 0; i < labels.size(); i++ ) + { + if ( labels[i] >= 0 ) + { + // mark samples in the given cluster that are outliers on either axis + if ( labels[i] == cluster && (data[j].s[0] < T_x_min || T_x_max < data[j].s[0] || data[j].s[1] < T_y_min || T_y_max < data[j].s[1]) ) + { + labels[i] = marker; + } + + // preserve all other non-outlier samples in the data array + else + { + data[numSamples] = data[j]; + numSamples++; + } + + j++; + } + } + + // return number of remaining samples + return numSamples; +} + + + + + + +/*! + * Perform outlier removal on each cluster in a parwise data array. + * + * @param data + * @param numSamples + * @param labels + * @param clusterSize + * @param marker + */ +int Similarity::Serial::removeOutliers(QVector& data, int numSamples, QVector& labels, qint8 clusterSize, qint8 marker) +{ + EDEBUG_FUNC(this,&data,numSamples,&labels,clusterSize,marker); + + // do not perform post-clustering outlier removal if there is only one cluster + if ( marker == -8 && clusterSize <= 1 ) + { + return numSamples; + } + + // perform outlier removal on each cluster + for ( qint8 k = 0; k < clusterSize; ++k ) + { + numSamples = removeOutliersCluster(data, labels, k, marker); + } + return numSamples; } diff --git a/src/core/similarity_serial.h b/src/core/similarity_serial.h index d1f2d85..d94f66c 100644 --- a/src/core/similarity_serial.h +++ b/src/core/similarity_serial.h @@ -1,9 +1,14 @@ #ifndef SIMILARITY_SERIAL_H #define SIMILARITY_SERIAL_H #include "similarity.h" +#include "pairwise_clusteringmodel.h" +#include "pairwise_correlationmodel.h" +/*! + * This class implements the serial working class of the similarity analytic. + */ class Similarity::Serial : public EAbstractAnalytic::Serial { Q_OBJECT @@ -11,9 +16,21 @@ class Similarity::Serial : public EAbstractAnalytic::Serial explicit Serial(Similarity* parent); virtual std::unique_ptr execute(const EAbstractAnalytic::Block* block) override final; private: - int fetchPair(Pairwise::Index index, QVector& X, QVector& labels); - + int fetchPair(const Pairwise::Index& index, QVector& data, QVector& labels); + int removeOutliersCluster(QVector& data, QVector& labels, qint8 cluster, qint8 marker); + int removeOutliers(QVector& data, int numSamples, QVector& labels, qint8 clusterSize, qint8 marker); + /*! + * Pointer to the base analytic for this object. + */ Similarity* _base; + /*! + * Pointer to the clustering model to use. + */ + Pairwise::ClusteringModel* _clusModel {nullptr}; + /*! + * Pointer to the correlation model to use. + */ + Pairwise::CorrelationModel* _corrModel {nullptr}; }; diff --git a/src/core/similarity_workblock.cpp b/src/core/similarity_workblock.cpp index de40777..85b80c4 100644 --- a/src/core/similarity_workblock.cpp +++ b/src/core/similarity_workblock.cpp @@ -5,11 +5,20 @@ +/*! + * Construct a new block with the given index, starting pairwise index, + * and pair size. + * + * @param index + * @param start + * @param size + */ Similarity::WorkBlock::WorkBlock(int index, qint64 start, qint64 size): EAbstractAnalytic::Block(index), _start(start), _size(size) { + EDEBUG_FUNC(this,index,start,size); } @@ -17,8 +26,15 @@ Similarity::WorkBlock::WorkBlock(int index, qint64 start, qint64 size): +/*! + * Write this block's data to the given data stream. + * + * @param stream + */ void Similarity::WorkBlock::write(QDataStream& stream) const { + EDEBUG_FUNC(this,&stream); + stream << _start << _size; } @@ -27,7 +43,14 @@ void Similarity::WorkBlock::write(QDataStream& stream) const +/*! + * Read this block's data from the given data stream. + * + * @param stream + */ void Similarity::WorkBlock::read(QDataStream& stream) { + EDEBUG_FUNC(this,&stream); + stream >> _start >> _size; } diff --git a/src/core/similarity_workblock.h b/src/core/similarity_workblock.h index 38f885f..4d1aa5b 100644 --- a/src/core/similarity_workblock.h +++ b/src/core/similarity_workblock.h @@ -4,10 +4,16 @@ +/*! + * This class implements the work block of the similarity analytic. + */ class Similarity::WorkBlock : public EAbstractAnalytic::Block { Q_OBJECT public: + /*! + * Construct a new work block in an uninitialized null state. + */ explicit WorkBlock() = default; explicit WorkBlock(int index, qint64 start, qint64 size); qint64 start() const { return _start; } @@ -16,7 +22,13 @@ class Similarity::WorkBlock : public EAbstractAnalytic::Block virtual void write(QDataStream& stream) const override final; virtual void read(QDataStream& stream) override final; private: + /*! + * The pairwise index of the first pair to process. + */ qint64 _start; + /*! + * The number of pairs to process. + */ qint64 _size; }; diff --git a/src/gui/gui.pro b/src/gui/gui.pro index ccce99e..185401a 100644 --- a/src/gui/gui.pro +++ b/src/gui/gui.pro @@ -5,6 +5,7 @@ include (../KINC.pri) # Basic settings QT += gui widgets TARGET = qkinc +TEMPLATE = app # External libraries LIBS += -lacegui @@ -12,6 +13,10 @@ LIBS += -lacegui # Compiler defines DEFINES += GUI=1 +# Source files +SOURCES += \ + ../main.cpp + # Installation instructions isEmpty(PREFIX) { PREFIX = /usr/local } program.path = $${PREFIX}/bin diff --git a/src/main.cpp b/src/main.cpp index a905bb0..9b244a2 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -17,7 +17,7 @@ using namespace std; int main(int argc, char *argv[]) { - EApplication application("" + EApplication application("SystemsGenetics" ,"kinc" ,MAJOR_VERSION ,MINOR_VERSION diff --git a/src/opencl.qrc b/src/opencl.qrc index c8ffac2..611bd7d 100644 --- a/src/opencl.qrc +++ b/src/opencl.qrc @@ -2,7 +2,6 @@ opencl/fetchpair.cl opencl/gmm.cl - opencl/kmeans.cl opencl/linalg.cl opencl/outlier.cl opencl/pearson.cl diff --git a/src/opencl/fetchpair.cl b/src/opencl/fetchpair.cl index 426926b..4ba2de3 100644 --- a/src/opencl/fetchpair.cl +++ b/src/opencl/fetchpair.cl @@ -6,10 +6,12 @@ -/** - * Fetch pairwise data for a pair of genes. Samples which are nan or are - * below a threshold are excluded. +/*! + * Extract pairwise data from an expression matrix given a pairwise index. Samples + * with missing values and samples that fall below the expression threshold are + * excluded. The number of extracted samples is returned. * + * @param globalWorkSize * @param expressions * @param sampleSize * @param in_index @@ -19,6 +21,7 @@ * @param out_labels */ __kernel void fetchPair( + int globalWorkSize, __global const float *expressions, int sampleSize, __global const int2 *in_index, @@ -29,23 +32,23 @@ __kernel void fetchPair( { int i = get_global_id(0); + if ( i >= globalWorkSize ) + { + return; + } + // initialize variables int2 index = in_index[i]; __global Vector2 *X = &out_X[i * sampleSize]; __global char *labels = &out_labels[i * sampleSize]; - __global int *p_N = &out_N[i]; - - if ( index.x == 0 && index.y == 0 ) - { - return; - } + __global int *p_numSamples = &out_N[i]; // index into gene expressions __global const float *gene1 = &expressions[index.x * sampleSize]; __global const float *gene2 = &expressions[index.y * sampleSize]; // populate X with shared expressions of gene pair - int N = 0; + int numSamples = 0; for ( int i = 0; i < sampleSize; ++i ) { @@ -59,13 +62,13 @@ __kernel void fetchPair( } else { - X[N].v2 = (float2) ( gene1[i], gene2[i] ); - N++; + X[numSamples] = (float2) ( gene1[i], gene2[i] ); + numSamples++; labels[i] = 0; } } // return size of X - *p_N = N; + *p_numSamples = numSamples; } diff --git a/src/opencl/gmm.cl b/src/opencl/gmm.cl index 970e549..ca4cfb2 100644 --- a/src/opencl/gmm.cl +++ b/src/opencl/gmm.cl @@ -22,21 +22,61 @@ typedef struct +typedef struct +{ + __global Component *components; + int K; + float logL; + float entropy; + __global Vector2 *_Mu; + __global int *_counts; + __global float *_logpi; + __global float *_gamma; +} GMM; + + + + + + +/*! + * Implementation of rand(), taken from POSIX example. + * + * @param state + */ +int rand(ulong *state) +{ + *state = (*state) * 1103515245 + 12345; + return ((unsigned)((*state)/65536) % 32768); +} + + + + + +/*! + * Initialize a mixture component with the given mixture weight and mean. + * + * @param component + * @param pi + * @param mu + */ void GMM_Component_initialize( __global Component *component, float pi, __global const Vector2 *mu) { - // initialize pi and mu as given + // initialize mixture weight and mean component->pi = pi; component->mu = *mu; - // Use identity covariance- assume dimensions are independent + // initialize covariance to identity matrix matrixInitIdentity(&component->sigma); - // Initialize zero artifacts + // initialize precision to zero matrix matrixInitZero(&component->sigmaInv); + // initialize normalizer term to 0 component->normalizer = 0; } @@ -45,23 +85,29 @@ void GMM_Component_initialize( -bool GMM_Component_prepareCovariance(__global Component *component) +/*! + * Pre-compute the precision matrix and normalizer term for a mixture component. + * + * @param component + */ +bool GMM_Component_prepare(__global Component *component) { const int D = 2; - // Compute inverse of Sigma once each iteration instead of - // repeatedly for each calcLogMvNorm execution. + // compute precision (inverse of covariance) float det; matrixInverse(&component->sigma, &component->sigmaInv, &det); - if ( fabs(det) <= 0 ) + // return failure if matrix inverse failed + if ( det <= 0 ) { return false; } - // Compute normalizer for multivariate normal distribution + // compute normalizer term for multivariate normal distribution component->normalizer = -0.5f * (D * log(2.0f * M_PI) + log(det)); + // return success return true; } @@ -70,30 +116,40 @@ bool GMM_Component_prepareCovariance(__global Component *component) -void GMM_Component_calcLogMvNorm( +/*! + * Compute the log of the probability density function of the multivariate normal + * distribution conditioned on a single component for each point in X: + * + * P(x|k) = exp(-0.5 * (x - mu)^T Sigma^-1 (x - mu)) / sqrt((2pi)^d det(Sigma)) + * + * Therefore the log-probability is: + * + * log(P(x|k)) = -0.5 * (x - mu)^T Sigma^-1 (x - mu) - 0.5 * (d * log(2pi) + log(det(Sigma))) + * + * @param component + * @param X + * @param N + * @param logP + */ +void GMM_Component_computeLogProbNorm( __global const Component *component, __global const Vector2 *X, int N, __global float *logP) { - // Here we are computing the probability density function of the multivariate - // normal distribution conditioned on a single component for the set of points - // given by X. - // - // P(x|k) = exp{ -0.5 * (x - mu)^T Sigma^{-} (x - mu) } / sqrt{ (2pi)^d det(Sigma) } - for (int i = 0; i < N; ++i) { - // Let xm = (x - mu) + // compute xm = (x - mu) Vector2 xm = X[i]; vectorSubtract(&xm, &component->mu); - // Compute xm^T Sxm = xm^T S^-1 xm + // compute Sxm = Sigma^-1 xm Vector2 Sxm; matrixProduct(&component->sigmaInv, &xm, &Sxm); + // compute xmSxm = xm^T Sigma^-1 xm float xmSxm = vectorDot(&xm, &Sxm); - // Compute log(P) = normalizer - 0.5 * xm^T * S^-1 * xm + // compute log(P) = normalizer - 0.5 * xm^T * Sigma^-1 * xm logP[i] = component->normalizer - 0.5f * xmSxm; } } @@ -103,61 +159,73 @@ void GMM_Component_calcLogMvNorm( -void GMM_kmeans( - __global Component *components, int K, - __global const Vector2 *X, int N, - __global Vector2 *MP, - __global int *counts) +/*! + * Initialize the mean of each component in the mixture model using k-means + * clustering. + * + * @param gmm + * @param X + * @param N + */ +void GMM_initializeMeans(GMM *gmm, __global const Vector2 *X, int N) { + const int K = gmm->K; + const int MAX_ITERATIONS = 20; const float TOLERANCE = 1e-3; float diff = 0; + // initialize workspace + __global Vector2 *Mu = gmm->_Mu; + __global int *counts = gmm->_counts; + for (int t = 0; t < MAX_ITERATIONS && diff > TOLERANCE; ++t) { - // initialize old means + // compute mean and sample count for each component for (int k = 0; k < K; ++k) { - vectorInitZero(&MP[k]); + vectorInitZero(&Mu[k]); counts[k] = 0; } - // compute new means for (int i = 0; i < N; ++i) { - float minD = INFINITY; - int minDk = 0; + // determine the component mean which is nearest to x_i + float min_dist = INFINITY; + int min_k = 0; for (int k = 0; k < K; ++k) { - float dist = vectorDiffNorm(&X[i], &components[k].mu); - if (minD > dist) + float dist = vectorDiffNorm(&X[i], &gmm->components[k].mu); + if (min_dist > dist) { - minD = dist; - minDk = k; + min_dist = dist; + min_k = k; } } - vectorAdd(&MP[minDk], &X[i]); - ++counts[minDk]; + // update mean and sample count + vectorAdd(&Mu[min_k], &X[i]); + ++counts[min_k]; } + // scale each mean by its sample count for (int k = 0; k < K; ++k) { - vectorScale(&MP[k], 1.0f / counts[k]); + vectorScale(&Mu[k], 1.0f / counts[k]); } - // check for convergence + // compute the total change of all means diff = 0; for (int k = 0; k < K; ++k) { - diff += vectorDiffNorm(&MP[k], &components[k].mu); + diff += vectorDiffNorm(&Mu[k], &gmm->components[k].mu); } diff /= K; - // copy new means to components + // update component means for (int k = 0; k < K; ++k) { - components[k].mu = MP[k]; + gmm->components[k].mu = Mu[k]; } } } @@ -167,117 +235,77 @@ void GMM_kmeans( -void GMM_calcLogMvNorm( - __global const Component *components, int K, - __global const Vector2 *X, int N, - __global float *loggamma) +/*! + * Perform the expectation step of the EM algorithm. In this step we compute + * gamma, the posterior probabilities for each component in the mixture model + * and each sample in X, as well as the log-likelihood of the model: + * + * log(p(x_i)) = a + log(sum(exp(log(pi_k) + log(P(x_i|k))) - a)) + * + * gamma_ki = exp(log(pi_k) + log(P(x_i|k)) - log(p(x_i))) + * + * log(L) = sum(log(p(x_i))) + * + * @param gmm + * @param X + * @param N + */ +float GMM_computeEStep(GMM *gmm, __global const Vector2 *X, int N) { - for ( int k = 0; k < K; ++k ) + const int K = gmm->K; + + // compute logpi + for (int k = 0; k < K; ++k) { - GMM_Component_calcLogMvNorm(&components[k], X, N, &loggamma[k * N]); + gmm->_logpi[k] = log(gmm->components[k].pi); } -} - - + // compute the log-probability for each component and each point in X + __global float *logProb = gmm->_gamma; + for ( int k = 0; k < K; ++k ) + { + GMM_Component_computeLogProbNorm(&gmm->components[k], X, N, &logProb[k * N]); + } + // compute gamma and log-likelihood + float logL = 0.0; -void GMM_calcLogLikelihoodAndGammaNK( - __global const float *logpi, int K, - __global float *loggamma, int N, - float *logL) -{ - *logL = 0.0; for (int i = 0; i < N; ++i) { + // compute a = argmax(logpi_k + logProb_ki, k) float maxArg = -INFINITY; for (int k = 0; k < K; ++k) { - const float logProbK = logpi[k] + loggamma[k * N + i]; - if (logProbK > maxArg) + float arg = gmm->_logpi[k] + logProb[k * N + i]; + if (maxArg < arg) { - maxArg = logProbK; + maxArg = arg; } } + // compute logpx float sum = 0.0; for (int k = 0; k < K; ++k) { - const float logProbK = logpi[k] + loggamma[k * N + i]; - sum += exp(logProbK - maxArg); - } - - const float logpx = maxArg + log(sum); - *logL += logpx; - for (int k = 0; k < K; ++k) - { - loggamma[k * N + i] += -logpx; - } - } -} - - - - - - -void GMM_calcLogGammaK( - __global const float *loggamma, int N, int K, - __global float *logGamma) -{ - for (int k = 0; k < K; ++k) - { - __global const float *loggammak = &loggamma[k * N]; - - float maxArg = -INFINITY; - for (int i = 0; i < N; ++i) - { - const float loggammank = loggammak[i]; - if (loggammank > maxArg) - { - maxArg = loggammank; - } - } - - float sum = 0; - for (int i = 0; i < N; ++i) - { - const float loggammank = loggammak[i]; - sum += exp(loggammank - maxArg); + sum += exp(gmm->_logpi[k] + logProb[k * N + i] - maxArg); } - logGamma[k] = maxArg + log(sum); - } -} - - - + float logpx = maxArg + log(sum); - - -float GMM_calcLogGammaSum( - __global const float *logpi, int K, - __global const float *logGamma) -{ - float maxArg = -INFINITY; - for (int k = 0; k < K; ++k) - { - const float arg = logpi[k] + logGamma[k]; - if (arg > maxArg) + // compute gamma_ki + for (int k = 0; k < K; ++k) { - maxArg = arg; + gmm->_gamma[k * N + i] += gmm->_logpi[k] - logpx; + gmm->_gamma[k * N + i] = exp(gmm->_gamma[k * N + i]); } - } - float sum = 0; - for (int k = 0; k < K; ++k) - { - const float arg = logpi[k] + logGamma[k]; - sum += exp(arg - maxArg); + // update log-likelihood + logL += logpx; } - return maxArg + log(sum); + // return log-likelihood + return logL; } @@ -285,79 +313,83 @@ float GMM_calcLogGammaSum( -bool GMM_performMStep( - __global Component *components, int K, - __global float *logpi, - __global float *loggamma, - __global float *logGamma, - float logGammaSum, - __global const Vector2 *X, int N) +/*! + * Perform the maximization step of the EM algorithm. In this step we update the + * parameters of the the mixture model using gamma, which is computed during the + * expectation step: + * + * n_k = sum(gamma_ki) + * + * pi_k = n_k / N + * + * mu_k = sum(gamma_ki * x_i)) / n_k + * + * Sigma_k = sum(gamma_ki * (x_i - mu_k) * (x_i - mu_k)^T) / n_k + * + * @param gmm + * @param X + * @param N + */ +bool GMM_computeMStep(GMM *gmm, __global const Vector2 *X, int N) { - // update pi - for (int k = 0; k < K; ++k) - { - logpi[k] += logGamma[k] - logGammaSum; - - components[k].pi = exp(logpi[k]); - } + const int K = gmm->K; - // convert loggamma / logGamma to gamma / Gamma to avoid duplicate exp(x) calls for (int k = 0; k < K; ++k) { + // compute n_k = sum(gamma_ki) + float n_k = 0; + for (int i = 0; i < N; ++i) { - const int idx = k * N + i; - loggamma[idx] = exp(loggamma[idx]); + n_k += gmm->_gamma[k * N + i]; } - } - for (int k = 0; k < K; ++k) - { - logGamma[k] = exp(logGamma[k]); - } + // update mixture weight + gmm->components[k].pi = n_k / N; - for (int k = 0; k < K; ++k) - { - // Update mu - __global Vector2 *mu = &components[k].mu; + // update mean + __global Vector2 *mu = &gmm->components[k].mu; vectorInitZero(mu); for (int i = 0; i < N; ++i) { - vectorAddScaled(mu, loggamma[k * N + i], &X[i]); + vectorAddScaled(mu, gmm->_gamma[k * N + i], &X[i]); } - vectorScale(mu, 1.0f / logGamma[k]); + vectorScale(mu, 1.0f / n_k); - // Update sigma - __global Matrix2x2 *sigma = &components[k].sigma; + // update covariance matrix + __global Matrix2x2 *sigma = &gmm->components[k].sigma; matrixInitZero(sigma); for (int i = 0; i < N; ++i) { - // xm = (x - mu) + // compute xm = (x_i - mu_k) Vector2 xm = X[i]; vectorSubtract(&xm, mu); - // S_i = gamma_ik * (x - mu) (x - mu)^T + // compute Sigma_ki = gamma_ki * (x_i - mu_k) (x_i - mu_k)^T Matrix2x2 outerProduct; matrixOuterProduct(&xm, &xm, &outerProduct); - matrixAddScaled(sigma, loggamma[k * N + i], &outerProduct); + matrixAddScaled(sigma, gmm->_gamma[k * N + i], &outerProduct); } - matrixScale(sigma, 1.0f / logGamma[k]); + matrixScale(sigma, 1.0f / n_k); - bool success = GMM_Component_prepareCovariance(&components[k]); + // pre-compute precision matrix and normalizer term + bool success = GMM_Component_prepare(&gmm->components[k]); + // return failure if matrix inverse failed if ( !success ) { return false; } } + // return success return true; } @@ -366,24 +398,36 @@ bool GMM_performMStep( -void GMM_calcLabels( - __global const float *loggamma, int N, int K, +/*! + * Compute the cluster labels of a dataset using gamma: + * + * y_i = argmax(gamma_ki, k) + * + * @param gamma + * @param N + * @param K + * @param labels + */ +void GMM_computeLabels( + __global const float *gamma, int N, int K, __global char *labels) { for ( int i = 0; i < N; ++i ) { + // determine the value k for which gamma_ki is highest int max_k = -1; float max_gamma = -INFINITY; for ( int k = 0; k < K; ++k ) { - if ( max_gamma < loggamma[k * N + i] ) + if ( max_gamma < gamma[k * N + i] ) { max_k = k; - max_gamma = loggamma[k * N + i]; + max_gamma = gamma[k * N + i]; } } + // assign x_i to cluster k labels[i] = max_k; } } @@ -393,8 +437,18 @@ void GMM_calcLabels( -float GMM_calcEntropy( - __global const float *loggamma, int N, +/*! + * Compute the entropy of the mixture model for a dataset using gamma + * and the given cluster labels: + * + * E = sum(sum(z_ki * log(gamma_ki))), z_ki = (y_i == k) + * + * @param gamma + * @param N + * @param labels + */ +float GMM_computeEntropy( + __global const float *gamma, int N, __global const char *labels) { float E = 0; @@ -403,7 +457,7 @@ float GMM_calcEntropy( { int k = labels[i]; - E += log(loggamma[k * N + i]); + E += log(gamma[k * N + i]); } return E; @@ -414,72 +468,60 @@ float GMM_calcEntropy( -/** - * Compute a Gaussian mixture model from a dataset. +/*! + * Fit the mixture model to a pairwise data array and compute the output cluster + * labels for the data. The data array should only contain clean samples. + * + * @param gmm + * @param X + * @param N + * @param K + * @param labels */ bool GMM_fit( + GMM *gmm, __global const Vector2 *X, int N, int K, - __global char *labels, - float *logL, - float *entropy, - __global Component *components, - __global Vector2 *MP, - __global int *counts, - __global float *logpi, - __global float *loggamma, - __global float *logGamma) + __global char *labels) { ulong state = 1; // initialize components + gmm->K = K; + for ( int k = 0; k < K; ++k ) { - // use uniform mixture proportion and randomly sampled mean + // use uniform mixture weight and randomly sampled mean int i = rand(&state) % N; - GMM_Component_initialize(&components[k], 1.0f / K, &X[i]); - GMM_Component_prepareCovariance(&components[k]); + GMM_Component_initialize(&gmm->components[k], 1.0f / K, &X[i]); + GMM_Component_prepare(&gmm->components[k]); } // initialize means with k-means - GMM_kmeans(components, K, X, N, MP, counts); - - // initialize workspace - for (int k = 0; k < K; ++k) - { - logpi[k] = log(components[k].pi); - } + GMM_initializeMeans(gmm, X, N); // run EM algorithm const int MAX_ITERATIONS = 100; const float TOLERANCE = 1e-8; float prevLogL = -INFINITY; - float currentLogL = -INFINITY; + float currLogL = -INFINITY; for ( int t = 0; t < MAX_ITERATIONS; ++t ) { - // E step - // compute gamma, log-likelihood - GMM_calcLogMvNorm(components, K, X, N, loggamma); - - prevLogL = currentLogL; - GMM_calcLogLikelihoodAndGammaNK(logpi, K, loggamma, N, ¤tLogL); + // perform E step + prevLogL = currLogL; + currLogL = GMM_computeEStep(gmm, X, N); // check for convergence - if ( fabs(currentLogL - prevLogL) < TOLERANCE ) + if ( fabs(currLogL - prevLogL) < TOLERANCE ) { break; } - // M step - // Let Gamma[k] = \Sum_i gamma[k, i] - GMM_calcLogGammaK(loggamma, N, K, logGamma); - - float logGammaSum = GMM_calcLogGammaSum(logpi, K, logGamma); - - // Update parameters - bool success = GMM_performMStep(components, K, logpi, loggamma, logGamma, logGammaSum, X, N); + // perform M step + bool success = GMM_computeMStep(gmm, X, N); + // return failure if M-step failed (due to matrix inverse) if ( !success ) { return false; @@ -487,9 +529,9 @@ bool GMM_fit( } // save outputs - *logL = currentLogL; - GMM_calcLabels(loggamma, N, K, labels); - *entropy = GMM_calcEntropy(loggamma, N, labels); + gmm->logL = currLogL; + GMM_computeLabels(gmm->_gamma, N, K, labels); + gmm->entropy = GMM_computeEntropy(gmm->_gamma, N, labels); return true; } @@ -501,6 +543,7 @@ bool GMM_fit( typedef enum { + AIC, BIC, ICL } Criterion; @@ -510,10 +553,34 @@ typedef enum -/** - * Compute the Bayes Information Criterion of a GMM. +/*! + * Compute the Akaike Information Criterion of a Gaussian mixture model. + * + * @param K + * @param D + * @param logL + */ +float GMM_computeAIC(int K, int D, float logL) +{ + int p = K * (1 + D + D * D); + + return 2 * p - 2 * logL; +} + + + + + + +/*! + * Compute the Bayesian Information Criterion of a Gaussian mixture model. + * + * @param K + * @param D + * @param logL + * @param N */ -float GMM_computeBIC(int K, float logL, int N, int D) +float GMM_computeBIC(int K, int D, float logL, int N) { int p = K * (1 + D + D * D); @@ -525,10 +592,16 @@ float GMM_computeBIC(int K, float logL, int N, int D) -/** - * Compute the Integrated Completed Likelihood of a GMM. +/*! + * Compute the Integrated Completed Likelihood of a Gaussian mixture model. + * + * @param K + * @param D + * @param logL + * @param N + * @param E */ -float GMM_computeICL(int K, float logL, int N, int D, float E) +float GMM_computeICL(int K, int D, float logL, int N, float E) { int p = K * (1 + D + D * D); @@ -540,22 +613,28 @@ float GMM_computeICL(int K, float logL, int N, int D, float E) -/** - * Compute a block of GMMs given a block of gene pairs. +/*! + * Determine the number of clusters in a pairwise data array. Several sub-models, + * each one having a different number of clusters, are fit to the data and the + * sub-model with the best criterion value is selected. The data array should + * only contain samples that have a non-negative label. * - * For each gene pair, several models are computed and the best model - * is selected according to a criterion (BIC). The selected K and the - * resulting sample mask for each pair is returned. + * @param globalWorkSize + * @param sampleSize + * @param minSamples + * @param minClusters + * @param maxClusters + * @param criterion + * @param out_K + * @param out_labels */ __kernel void GMM_compute( - __global const float *expressions, + int globalWorkSize, int sampleSize, int minSamples, char minClusters, char maxClusters, Criterion criterion, - int removePreOutliers, - int removePostOutliers, __global Vector2 *work_X, __global int *work_N, __global char *work_labels, @@ -563,71 +642,68 @@ __kernel void GMM_compute( __global Vector2 *work_MP, __global int *work_counts, __global float *work_logpi, - __global float *work_loggamma, - __global float *work_logGamma, + __global float *work_gamma, __global char *out_K, __global char *out_labels) { int i = get_global_id(0); + if ( i >= globalWorkSize ) + { + return; + } + // initialize workspace variables - __global Vector2 *X = &work_X[i * sampleSize]; - int N = work_N[i]; + __global Vector2 *data = &work_X[i * sampleSize]; + int numSamples = work_N[i]; __global char *labels = &work_labels[i * sampleSize]; __global Component *components = &work_components[i * maxClusters]; - __global Vector2 *MP = &work_MP[i * maxClusters]; + __global Vector2 *Mu = &work_MP[i * maxClusters]; __global int *counts = &work_counts[i * maxClusters]; __global float *logpi = &work_logpi[i * maxClusters]; - __global float *loggamma = &work_loggamma[i * maxClusters * sampleSize]; - __global float *logGamma = &work_logGamma[i * maxClusters]; + __global float *gamma = &work_gamma[i * maxClusters * sampleSize]; __global char *bestK = &out_K[i]; __global char *bestLabels = &out_labels[i * sampleSize]; - // remove pre-clustering outliers - __global float *work = loggamma; - - if ( removePreOutliers ) - { - markOutliers(X, N, 0, bestLabels, 0, -7, work); - markOutliers(X, N, 1, bestLabels, 0, -7, work); - } + // initialize GMM struct + GMM gmm = { + .components = components, + ._Mu = Mu, + ._counts = counts, + ._logpi = logpi, + ._gamma = gamma + }; // perform clustering only if there are enough samples *bestK = 0; - if ( N >= minSamples ) + if ( numSamples >= minSamples ) { float bestValue = INFINITY; for ( char K = minClusters; K <= maxClusters; ++K ) { - // run each clustering model - float logL; - float entropy; - - bool success = GMM_fit( - X, N, K, - labels, &logL, &entropy, - components, - MP, counts, - logpi, loggamma, logGamma - ); + // run each clustering sub-model + bool success = GMM_fit(&gmm, data, numSamples, K, labels); if ( !success ) { continue; } - // evaluate model + // compute the criterion value of the sub-model float value = INFINITY; switch (criterion) { + case AIC: + value = GMM_computeAIC(K, 2, gmm.logL); + break; case BIC: - value = GMM_computeBIC(K, logL, N, 2); + value = GMM_computeBIC(K, 2, gmm.logL, numSamples); break; case ICL: - value = GMM_computeICL(K, logL, N, 2, entropy); + value = GMM_computeICL(K, 2, gmm.logL, numSamples, gmm.entropy); break; } @@ -637,7 +713,7 @@ __kernel void GMM_compute( *bestK = K; bestValue = value; - for ( int i = 0, j = 0; i < N; ++i ) + for ( int i = 0, j = 0; i < sampleSize; ++i ) { if ( bestLabels[i] >= 0 ) { @@ -648,17 +724,4 @@ __kernel void GMM_compute( } } } - - if ( *bestK > 1 ) - { - // remove post-clustering outliers - if ( removePostOutliers ) - { - for ( char k = 0; k < *bestK; ++k ) - { - markOutliers(X, N, 0, bestLabels, k, -8, work); - markOutliers(X, N, 1, bestLabels, k, -8, work); - } - } - } } diff --git a/src/opencl/kmeans.cl b/src/opencl/kmeans.cl deleted file mode 100644 index 305e196..0000000 --- a/src/opencl/kmeans.cl +++ /dev/null @@ -1,277 +0,0 @@ - -// #include "fetchpair.cl" -// #include "linalg.cl" -// #include "outlier.cl" - - - - - - -/** - * Compute the log-likelihood of a K-means model given data X. - * - * @param X - * @param N - * @param y - * @param means - * @param K - */ -float KMeans_computeLogLikelihood( - __global const Vector2 *X, int N, - __global const char *y, - __global const Vector2 *means, int K) -{ - // compute within-class scatter - float S = 0; - - for ( int k = 0; k < K; ++k ) - { - for ( int i = 0; i < N; ++i ) - { - if ( y[i] != k ) - { - continue; - } - - float dist = vectorDiffNorm(&X[i], &means[k]); - - S += dist * dist; - } - } - - return -S; -} - - - - - - -/** - * Compute a K-means clustering model from a dataset. - */ -void KMeans_fit( - __global const Vector2 *X, int N, int K, - float *logL, - __global char *labels, - __global Vector2 *means, - __global char *y, - __global char *y_next) -{ - ulong state = 1; - - const int NUM_INITS = 10; - const int MAX_ITERATIONS = 300; - - // repeat with several initializations - *logL = -INFINITY; - - for ( int init = 0; init < NUM_INITS; ++init ) - { - // initialize means randomly from X - for ( int k = 0; k < K; ++k ) - { - int i = rand(&state) % N; - means[k] = X[i]; - } - - // iterate K means until convergence - for ( int t = 0; t < MAX_ITERATIONS; ++t ) - { - // compute new labels - for ( int i = 0; i < N; ++i ) - { - // find k that minimizes norm(x_i - mu_k) - int min_k = -1; - float min_dist; - - for ( int k = 0; k < K; ++k ) - { - float dist = vectorDiffNorm(&X[i], &means[k]); - - if ( min_k == -1 || dist < min_dist ) - { - min_k = k; - min_dist = dist; - } - } - - y_next[i] = min_k; - } - - // check for convergence - bool converged = true; - - for ( int i = 0; i < N; ++i ) - { - if ( y[i] != y_next[i] ) - { - converged = false; - break; - } - } - - if ( converged ) - { - break; - } - - // update labels - for ( int i = 0; i < N; ++i ) - { - y[i] = y_next[i]; - } - - // update means - for ( int k = 0; k < K; ++k ) - { - // compute mu_k = mean of all x_i in cluster k - int n_k = 0; - - vectorInitZero(&means[k]); - - for ( int i = 0; i < N; ++i ) - { - if ( y[i] == k ) - { - vectorAdd(&means[k], &X[i]); - n_k++; - } - } - - vectorScale(&means[k], 1.0f / n_k); - } - } - - // save the run with the greatest log-likelihood - float nextLogL = KMeans_computeLogLikelihood(X, N, y, means, K); - - if ( *logL < nextLogL ) - { - *logL = nextLogL; - - for ( int i = 0; i < N; ++i ) - { - labels[i] = y[i]; - } - } - } -} - - - - - - -/** - * Compute the Bayes information criterion of a K-means model. - * - * @param K - * @param logL - * @param N - * @param D - */ -float KMeans_computeBIC(int K, float logL, int N, int D) -{ - int p = K * D; - - return log((float) N) * p - 2 * logL; -} - - - - - - -/** - * Compute a block of K-means models given a block of gene pairs. - * - * For each gene pair, several models are computed and the best model - * is selected according to a criterion (BIC). The selected K and the - * resulting sample mask for each pair is returned. - */ -__kernel void KMeans_compute( - __global const float *expressions, - int sampleSize, - int minSamples, - char minClusters, - char maxClusters, - int removePreOutliers, - int removePostOutliers, - __global Vector2 *work_X, - __global int *work_N, - __global float *work_outlier, - __global char *work_labels, - __global Vector2 *work_means, - __global char *out_K, - __global char *out_labels) -{ - int i = get_global_id(0); - - // initialize workspace variables - __global Vector2 *X = &work_X[i * sampleSize]; - int N = work_N[i]; - __global char *labels = &work_labels[(3*i+0) * sampleSize]; - __global Vector2 *means = &work_means[i * maxClusters]; - __global char *y = &work_labels[(3*i+1) * sampleSize]; - __global char *y_next = &work_labels[(3*i+2) * sampleSize]; - __global char *bestK = &out_K[i]; - __global char *bestLabels = &out_labels[i * sampleSize]; - - // remove pre-clustering outliers - __global float *work = &work_outlier[i * sampleSize]; - - if ( removePreOutliers ) - { - markOutliers(X, N, 0, bestLabels, 0, -7, work); - markOutliers(X, N, 1, bestLabels, 0, -7, work); - } - - // perform clustering only if there are enough samples - *bestK = 0; - - if ( N >= minSamples ) - { - float bestValue = INFINITY; - - for ( char K = minClusters; K <= maxClusters; ++K ) - { - // run each clustering model - float logL; - KMeans_fit(X, N, K, &logL, labels, means, y, y_next); - - // evaluate model - float value = KMeans_computeBIC(K, logL, N, 2); - - // save the best model - if ( value < bestValue ) - { - *bestK = K; - bestValue = value; - - for ( int i = 0, j = 0; i < N; ++i ) - { - if ( bestLabels[i] >= 0 ) - { - bestLabels[i] = y[j]; - ++j; - } - } - } - } - } - - if ( *bestK > 1 ) - { - // remove post-clustering outliers - if ( removePostOutliers ) - { - for ( char k = 0; k < *bestK; ++k ) - { - markOutliers(X, N, 0, bestLabels, k, -8, work); - markOutliers(X, N, 1, bestLabels, k, -8, work); - } - } - } -} diff --git a/src/opencl/linalg.cl b/src/opencl/linalg.cl index ce44046..504b194 100644 --- a/src/opencl/linalg.cl +++ b/src/opencl/linalg.cl @@ -1,22 +1,22 @@ -typedef union -{ - float s[2]; - float2 v2; -} Vector2; - -typedef union -{ - float s[4]; - float4 v4; -} Matrix2x2; - - - - - - -#define ELEM(M, i, j) ((M)->s[(i) * 2 + (j)]) +/*! + * This file provides structure and function definitions for the Vector2 and + * Matrix2x2 types, which are vector and matrix types with fixed dimensions. + * The operations defined for these types compute outputs directly without the + * use of loops. These types are useful for any algorithm that operates on + * pairwise data. + * + * Since OpenCL provides built-in vector types, Vector2 and Matrix2x2 are + * defined in terms of these types. The following mapping is used to map + * indices to xyzw: + * + * ELEM(M, 0, 0) = M->x + * ELEM(M, 0, 1) = M->y + * ELEM(M, 1, 0) = M->z + * ELEM(M, 1, 1) = M->w + */ +typedef float2 Vector2; +typedef float4 Matrix2x2; @@ -24,8 +24,8 @@ typedef union #define vectorInitZero(a) \ - (a)->s[0] = 0; \ - (a)->s[1] = 0; + (a)->x = 0; \ + (a)->y = 0; @@ -33,8 +33,8 @@ typedef union #define vectorAdd(a, b) \ - (a)->s[0] += (b)->s[0]; \ - (a)->s[1] += (b)->s[1]; + (a)->x += (b)->x; \ + (a)->y += (b)->y; @@ -42,8 +42,8 @@ typedef union #define vectorAddScaled(a, c, b) \ - (a)->s[0] += (c) * (b)->s[0]; \ - (a)->s[1] += (c) * (b)->s[1]; + (a)->x += (c) * (b)->x; \ + (a)->y += (c) * (b)->y; @@ -51,8 +51,8 @@ typedef union #define vectorSubtract(a, b) \ - (a)->s[0] -= (b)->s[0]; \ - (a)->s[1] -= (b)->s[1]; + (a)->x -= (b)->x; \ + (a)->y -= (b)->y; @@ -60,8 +60,8 @@ typedef union #define vectorScale(a, c) \ - (a)->s[0] *= (c); \ - (a)->s[1] *= (c); + (a)->x *= (c); \ + (a)->y *= (c); @@ -69,7 +69,7 @@ typedef union #define vectorDot(a, b) \ - ((a)->s[0] * (b)->s[0] + (a)->s[1] * (b)->s[1]) + ((a)->x * (b)->x + (a)->y * (b)->y) @@ -78,7 +78,7 @@ typedef union #define SQR(x) ((x)*(x)) #define vectorDiffNorm(a, b) \ - sqrt(SQR((a)->s[0] - (b)->s[0]) + SQR((a)->s[1] - (b)->s[1])) + sqrt(SQR((a)->x - (b)->x) + SQR((a)->y - (b)->y)) @@ -86,10 +86,10 @@ typedef union #define matrixInitIdentity(M) \ - ELEM(M, 0, 0) = 1; \ - ELEM(M, 0, 1) = 0; \ - ELEM(M, 1, 0) = 0; \ - ELEM(M, 1, 1) = 1; + (M)->x = 1; \ + (M)->y = 0; \ + (M)->z = 0; \ + (M)->w = 1; @@ -97,10 +97,10 @@ typedef union #define matrixInitZero(M) \ - ELEM(M, 0, 0) = 0; \ - ELEM(M, 0, 1) = 0; \ - ELEM(M, 1, 0) = 0; \ - ELEM(M, 1, 1) = 0; + (M)->x = 0; \ + (M)->y = 0; \ + (M)->z = 0; \ + (M)->w = 0; @@ -108,10 +108,10 @@ typedef union #define matrixAddScaled(A, c, B) \ - ELEM(A, 0, 0) += (c) * ELEM(B, 0, 0); \ - ELEM(A, 0, 1) += (c) * ELEM(B, 0, 1); \ - ELEM(A, 1, 0) += (c) * ELEM(B, 1, 0); \ - ELEM(A, 1, 1) += (c) * ELEM(B, 1, 1); + (A)->x += (c) * (B)->x; \ + (A)->y += (c) * (B)->y; \ + (A)->z += (c) * (B)->z; \ + (A)->w += (c) * (B)->w; @@ -119,10 +119,10 @@ typedef union #define matrixScale(A, c) \ - ELEM(A, 0, 0) *= (c); \ - ELEM(A, 0, 1) *= (c); \ - ELEM(A, 1, 0) *= (c); \ - ELEM(A, 1, 1) *= (c); + (A)->x *= (c); \ + (A)->y *= (c); \ + (A)->z *= (c); \ + (A)->w *= (c); @@ -130,20 +130,20 @@ typedef union #define matrixInverse(A, B, det) \ - *det = ELEM(A, 0, 0) * ELEM(A, 1, 1) - ELEM(A, 0, 1) * ELEM(A, 1, 0); \ - ELEM(B, 0, 0) = +ELEM(A, 1, 1) / (*det); \ - ELEM(B, 0, 1) = -ELEM(A, 0, 1) / (*det); \ - ELEM(B, 1, 0) = -ELEM(A, 1, 0) / (*det); \ - ELEM(B, 1, 1) = +ELEM(A, 0, 0) / (*det); + *det = (A)->x * (A)->w - (A)->y * (A)->z; \ + (B)->x = +(A)->w / (*det); \ + (B)->y = -(A)->y / (*det); \ + (B)->z = -(A)->z / (*det); \ + (B)->w = +(A)->x / (*det); -#define matrixProduct(A, x, b) \ - (b)->s[0] = ELEM(A, 0, 0) * (x)->s[0] + ELEM(A, 0, 1) * (x)->s[1]; \ - (b)->s[1] = ELEM(A, 1, 0) * (x)->s[0] + ELEM(A, 1, 1) * (x)->s[1]; +#define matrixProduct(A, x_, b) \ + (b)->x = (A)->x * (x_)->x + (A)->y * (x_)->y; \ + (b)->y = (A)->z * (x_)->x + (A)->w * (x_)->y; @@ -151,7 +151,7 @@ typedef union #define matrixOuterProduct(a, b, C) \ - ELEM(C, 0, 0) = (a)->s[0] * (b)->s[0]; \ - ELEM(C, 0, 1) = (a)->s[0] * (b)->s[1]; \ - ELEM(C, 1, 0) = (a)->s[1] * (b)->s[0]; \ - ELEM(C, 1, 1) = (a)->s[1] * (b)->s[1]; + (C)->x = (a)->x * (b)->x; \ + (C)->y = (a)->x * (b)->y; \ + (C)->z = (a)->y * (b)->x; \ + (C)->w = (a)->y * (b)->y; diff --git a/src/opencl/outlier.cl b/src/opencl/outlier.cl index e8e29e6..761693b 100644 --- a/src/opencl/outlier.cl +++ b/src/opencl/outlier.cl @@ -6,69 +6,153 @@ -/** - * Implementation of rand(), taken from POSIX example. +/*! + * Remove outliers from a vector of pairwise data. Outliers are detected independently + * on each axis using the Tukey method, and marked with the given marker. Only the + * samples in the given cluster are used in outlier detection. For unclustered data, + * all samples are labeled as 0, so a cluster value of 0 should be used. The data + * array should only contain samples that have a non-negative label. * - * @param state - */ -int rand(ulong *state) -{ - *state = (*state) * 1103515245 + 12345; - return ((unsigned)((*state)/65536) % 32768); -} - - - - - -/** - * Remove outliers from a gene in a gene pair. - * - * @param X - * @param N - * @param j + * @param data * @param labels + * @param sampleSize * @param cluster * @param marker + * @param x_sorted + * @param y_sorted */ -void markOutliers( - __global const Vector2 *X, int N, int j, - __global char *labels, char cluster, +int removeOutliersCluster( + __global Vector2 *data, + __global char *labels, + int sampleSize, + char cluster, char marker, - __global float *x_sorted) + __global float *x_sorted, + __global float *y_sorted) { - // compute x_sorted = X[:, j], filtered and sorted + // extract univariate data from the given cluster int n = 0; - for ( int i = 0; i < N; i++ ) + for ( int i = 0, j = 0; i < sampleSize; i++ ) { - if ( labels[i] == cluster || labels[i] == marker ) + if ( labels[i] >= 0 ) { - x_sorted[n] = X[i].s[j]; - n++; + if ( labels[i] == cluster ) + { + x_sorted[n] = data[j].x; + y_sorted[n] = data[j].y; + n++; + } + + j++; } } + // return if the given cluster is empty if ( n == 0 ) { - return; + return 0; } + // sort samples for each axis heapSort(x_sorted, n); + heapSort(y_sorted, n); - // compute quartiles, interquartile range, upper and lower bounds - float Q1 = x_sorted[n * 1 / 4]; - float Q3 = x_sorted[n * 3 / 4]; + // compute interquartile range and thresholds for each axis + float Q1_x = x_sorted[n * 1 / 4]; + float Q3_x = x_sorted[n * 3 / 4]; + float T_x_min = Q1_x - 1.5f * (Q3_x - Q1_x); + float T_x_max = Q3_x + 1.5f * (Q3_x - Q1_x); - float T_min = Q1 - 1.5f * (Q3 - Q1); - float T_max = Q3 + 1.5f * (Q3 - Q1); + float Q1_y = y_sorted[n * 1 / 4]; + float Q3_y = y_sorted[n * 3 / 4]; + float T_y_min = Q1_y - 1.5f * (Q3_y - Q1_y); + float T_y_max = Q3_y + 1.5f * (Q3_y - Q1_y); - // mark outliers - for ( int i = 0; i < N; ++i ) + // remove outliers + int numSamples = 0; + + for ( int i = 0, j = 0; i < sampleSize; i++ ) { - if ( labels[i] == cluster && (X[i].s[j] < T_min || T_max < X[i].s[j]) ) + if ( labels[i] >= 0 ) { - labels[i] = marker; + // mark samples in the given cluster that are outliers on either axis + if ( labels[i] == cluster && (data[j].x < T_x_min || T_x_max < data[j].x || data[j].y < T_y_min || T_y_max < data[j].y) ) + { + labels[i] = marker; + } + + // preserve all other non-outlier samples in the data array + else + { + data[numSamples] = data[j]; + numSamples++; + } + + j++; } } + + // return number of remaining samples + return numSamples; +} + + + + + + +/*! + * Perform outlier removal on each cluster in a parwise data array. + * + * @param globalWorkSize + * @param in_data + * @param in_N + * @param in_labels + * @param sampleSize + * @param in_K + * @param marker + */ +__kernel void removeOutliers( + int globalWorkSize, + __global Vector2 *in_data, + __global int *in_N, + __global char *in_labels, + int sampleSize, + __global char *in_K, + char marker, + __global float *work_x, + __global float *work_y) +{ + int i = get_global_id(0); + + if ( i >= globalWorkSize ) + { + return; + } + + // initialize workspace variables + __global Vector2 *data = &in_data[i * sampleSize]; + __global int *numSamples = &in_N[i]; + __global char *labels = &in_labels[i * sampleSize]; + char clusterSize = in_K[i]; + __global float *x_sorted = &work_x[i * sampleSize]; + __global float *y_sorted = &work_y[i * sampleSize]; + + if ( marker == -7 ) + { + clusterSize = 1; + } + + // do not perform post-clustering outlier removal if there is only one cluster + if ( marker == -8 && clusterSize <= 1 ) + { + return; + } + + // perform outlier removal on each cluster + for ( char k = 0; k < clusterSize; ++k ) + { + *numSamples = removeOutliersCluster(data, labels, sampleSize, k, marker, x_sorted, y_sorted); + } } diff --git a/src/opencl/pearson.cl b/src/opencl/pearson.cl index 1da3b85..c08f143 100644 --- a/src/opencl/pearson.cl +++ b/src/opencl/pearson.cl @@ -4,9 +4,20 @@ +/*! + * Compute the Pearson correlation of a cluster in a pairwise data array. The + * data array should only contain samples that have a non-negative label. + * + * @param data + * @param labels + * @param sampleSize + * @param cluster + * @param minSamples + */ float Pearson_computeCluster( __global const float2 *data, - __global const char *labels, int N, + __global const char *labels, + int sampleSize, char cluster, int minSamples) { @@ -18,20 +29,25 @@ float Pearson_computeCluster( float sumy2 = 0; float sumxy = 0; - for ( int i = 0; i < N; ++i ) + for ( int i = 0, j = 0; i < sampleSize; ++i ) { - if ( labels[i] == cluster ) + if ( labels[i] >= 0 ) { - float x_i = data[i].x; - float y_i = data[i].y; + if ( labels[i] == cluster ) + { + float x_i = data[j].x; + float y_i = data[j].y; + + sumx += x_i; + sumy += y_i; + sumx2 += x_i * x_i; + sumy2 += y_i * y_i; + sumxy += x_i * y_i; - sumx += x_i; - sumy += y_i; - sumx2 += x_i * x_i; - sumy2 += y_i * y_i; - sumxy += x_i * y_i; + ++n; + } - ++n; + ++j; } } @@ -51,7 +67,21 @@ float Pearson_computeCluster( +/*! + * Compute the correlation of each cluster in a pairwise data array. The data array + * should only contain the clean samples that were extracted from the expression + * matrix, while the labels should contain all samples. + * + * @param globalWorkSize + * @param in_data + * @param clusterSize + * @param in_labels + * @param sampleSize + * @param minSamples + * @param out_correlations + */ __kernel void Pearson_compute( + int globalWorkSize, __global const float2 *in_data, char clusterSize, __global const char *in_labels, @@ -61,6 +91,12 @@ __kernel void Pearson_compute( { int i = get_global_id(0); + if ( i >= globalWorkSize ) + { + return; + } + + // initialize workspace variables __global const float2 *data = &in_data[i * sampleSize]; __global const char *labels = &in_labels[i * sampleSize]; __global float *correlations = &out_correlations[i * clusterSize]; diff --git a/src/opencl/sort.cl b/src/opencl/sort.cl index 318906b..77584d7 100644 --- a/src/opencl/sort.cl +++ b/src/opencl/sort.cl @@ -4,7 +4,7 @@ -/** +/*! * Swap two values * * @param a @@ -22,7 +22,7 @@ void swapF(__global float* a, __global float* b) -/** +/*! * Swap two values * * @param a @@ -92,7 +92,7 @@ void heapify(__global float *array, int n) -/** +/*! * Sort an array using heapsort. * * @param array @@ -117,11 +117,10 @@ void heapSort(__global float *array, int n) -/** - * Sort a list using the bitonic algorithm. Additionally, - * rearrange a second list with the same operations that are - * done to the sorted list. The size of each list must be a - * power of 2. +/*! + * Sort a list using bitonic sort, while also applying the same swap operations + * to a second list of the same size. The lists should have a size which is a + * power of two. * * @param size * @param sortList @@ -160,11 +159,10 @@ void bitonicSortFF(int size, __global float* sortList, __global float* extraList -/** - * Sort a list using the bitonic algorithm. Additionally, - * rearrange a second list with the same operations that are - * done to the sorted list. The size of each list must be a - * power of 2. +/*! + * Sort a list using bitonic sort, while also applying the same swap operations + * to a second list of the same size. The lists should have a size which is a + * power of two. * * @param size * @param sortList diff --git a/src/opencl/spearman.cl b/src/opencl/spearman.cl index 7bd05b2..15faf38 100644 --- a/src/opencl/spearman.cl +++ b/src/opencl/spearman.cl @@ -6,15 +6,20 @@ +/*! + * Compute the next power of 2 which occurs after a number. + * + * @param n + */ int nextPower2(int n) { - int pow2 = 2; - while ( pow2 < n ) - { - pow2 *= 2; - } + int pow2 = 2; + while ( pow2 < n ) + { + pow2 *= 2; + } - return pow2; + return pow2; } @@ -22,20 +27,34 @@ int nextPower2(int n) +/*! + * Compute the Spearman correlation of a cluster in a pairwise data array. The + * data array should only contain samples that have a non-negative label. + * + * @param data + * @param labels + * @param sampleSize + * @param cluster + * @param minSamples + * @param x + * @param y + * @param rank + */ float Spearman_computeCluster( __global const float2 *data, - __global const char *labels, int N, + __global const char *labels, + int sampleSize, char cluster, int minSamples, __global float *x, __global float *y, __global int *rank) { - // extract samples in gene pair cluster - int N_pow2 = nextPower2(N); - int n = 0; + // extract samples in pairwise cluster + int N_pow2 = nextPower2(sampleSize); + int n = 0; - for ( int i = 0, j = 0; i < N; ++i ) + for ( int i = 0, j = 0; i < sampleSize; ++i ) { if ( labels[i] >= 0 ) { @@ -43,7 +62,7 @@ float Spearman_computeCluster( { x[n] = data[j].x; y[n] = data[j].y; - rank[n] = n + 1; + rank[n] = n + 1; ++n; } @@ -91,11 +110,25 @@ float Spearman_computeCluster( +/*! + * Compute the correlation of each cluster in a pairwise data array. The data array + * should only contain the clean samples that were extracted from the expression + * matrix, while the labels should contain all samples. + * + * @param globalWorkSize + * @param in_data + * @param clusterSize + * @param in_labels + * @param sampleSize + * @param minSamples + * @param out_correlations + */ __kernel void Spearman_compute( + int globalWorkSize, __global const float2 *in_data, char clusterSize, __global const char *in_labels, - int sampleSize, + int sampleSize, int minSamples, __global float *work_x, __global float *work_y, @@ -103,8 +136,14 @@ __kernel void Spearman_compute( __global float *out_correlations) { int i = get_global_id(0); - int N_pow2 = nextPower2(sampleSize); + if ( i >= globalWorkSize ) + { + return; + } + + // initialize workspace variables + int N_pow2 = nextPower2(sampleSize); __global const float2 *data = &in_data[i * sampleSize]; __global const char *labels = &in_labels[i * sampleSize]; __global float *x = &work_x[i * N_pow2]; diff --git a/tests/main.cpp b/src/tests/main.cpp similarity index 95% rename from tests/main.cpp rename to src/tests/main.cpp index 6e979a8..cc70098 100644 --- a/tests/main.cpp +++ b/src/tests/main.cpp @@ -1,5 +1,5 @@ -#include "analyticfactory.h" -#include "datafactory.h" +#include "../core/analyticfactory.h" +#include "../core/datafactory.h" #include "testclustermatrix.h" #include "testcorrelationmatrix.h" #include "testexportcorrelationmatrix.h" diff --git a/tests/testclustermatrix.cpp b/src/tests/testclustermatrix.cpp similarity index 93% rename from tests/testclustermatrix.cpp rename to src/tests/testclustermatrix.cpp index f7513c3..7106370 100644 --- a/tests/testclustermatrix.cpp +++ b/src/tests/testclustermatrix.cpp @@ -2,8 +2,9 @@ #include #include "testclustermatrix.h" -#include "ccmatrix.h" -#include "datafactory.h" +#include "../core/ccmatrix.h" +#include "../core/ccmatrix_pair.h" +#include "../core/datafactory.h" @@ -56,7 +57,7 @@ void TestClusterMatrix::test() // create data object QString path {QDir::tempPath() + "/test.ccm"}; - std::unique_ptr dataRef {new Ace::DataObject(path, DataFactory::CCMatrixType, EMetadata(EMetadata::Object))}; + std::unique_ptr dataRef {new Ace::DataObject(path, DataFactory::CCMatrixType, EMetaObject())}; CCMatrix* matrix {dataRef->data()->cast()}; // write data to file diff --git a/tests/testclustermatrix.h b/src/tests/testclustermatrix.h similarity index 88% rename from tests/testclustermatrix.h rename to src/tests/testclustermatrix.h index 5168101..61d9d3c 100644 --- a/tests/testclustermatrix.h +++ b/src/tests/testclustermatrix.h @@ -2,7 +2,7 @@ #define TESTCLUSTERMATRIX_H #include -#include "pairwise_index.h" +#include "../core/pairwise_index.h" diff --git a/tests/testcorrelationmatrix.cpp b/src/tests/testcorrelationmatrix.cpp similarity index 91% rename from tests/testcorrelationmatrix.cpp rename to src/tests/testcorrelationmatrix.cpp index 10e1f6e..4f5a1ae 100644 --- a/tests/testcorrelationmatrix.cpp +++ b/src/tests/testcorrelationmatrix.cpp @@ -2,8 +2,9 @@ #include #include "testcorrelationmatrix.h" -#include "correlationmatrix.h" -#include "datafactory.h" +#include "../core/correlationmatrix.h" +#include "../core/correlationmatrix_pair.h" +#include "../core/datafactory.h" @@ -47,7 +48,7 @@ void TestCorrelationMatrix::test() // create data object QString path {QDir::tempPath() + "/test.cmx"}; - std::unique_ptr dataRef {new Ace::DataObject(path, DataFactory::CorrelationMatrixType, EMetadata(EMetadata::Object))}; + std::unique_ptr dataRef {new Ace::DataObject(path, DataFactory::CorrelationMatrixType, EMetaObject())}; CorrelationMatrix* matrix {dataRef->data()->cast()}; // write data to file diff --git a/tests/testcorrelationmatrix.h b/src/tests/testcorrelationmatrix.h similarity index 88% rename from tests/testcorrelationmatrix.h rename to src/tests/testcorrelationmatrix.h index 48e0d80..1caac53 100644 --- a/tests/testcorrelationmatrix.h +++ b/src/tests/testcorrelationmatrix.h @@ -2,7 +2,7 @@ #define TESTCORRELATIONMATRIX_H #include -#include "pairwise_index.h" +#include "../core/pairwise_index.h" diff --git a/tests/testexportcorrelationmatrix.cpp b/src/tests/testexportcorrelationmatrix.cpp similarity index 94% rename from tests/testexportcorrelationmatrix.cpp rename to src/tests/testexportcorrelationmatrix.cpp index e82aacd..cee4337 100644 --- a/tests/testexportcorrelationmatrix.cpp +++ b/src/tests/testexportcorrelationmatrix.cpp @@ -3,9 +3,11 @@ #include #include "testexportcorrelationmatrix.h" -#include "analyticfactory.h" -#include "datafactory.h" -#include "exportcorrelationmatrix_input.h" +#include "../core/analyticfactory.h" +#include "../core/datafactory.h" +#include "../core/exportcorrelationmatrix_input.h" +#include "../core/ccmatrix_pair.h" +#include "../core/correlationmatrix_pair.h" @@ -167,7 +169,7 @@ void TestExportCorrelationMatrix::test() { for ( int i = 0; i < sampleMask.size(); ++i ) { - QCOMPARE(sampleMask[i].digitValue(), testPair.sampleMasks[k][i]); + QCOMPARE((qint8) sampleMask[i].digitValue(), testPair.sampleMasks[k][i]); } } diff --git a/tests/testexportcorrelationmatrix.h b/src/tests/testexportcorrelationmatrix.h similarity index 90% rename from tests/testexportcorrelationmatrix.h rename to src/tests/testexportcorrelationmatrix.h index 9866ad1..23351cb 100644 --- a/tests/testexportcorrelationmatrix.h +++ b/src/tests/testexportcorrelationmatrix.h @@ -2,7 +2,7 @@ #define TESTEXPORTCORRELATIONMATRIX_H #include -#include "pairwise_index.h" +#include "../core/pairwise_index.h" diff --git a/tests/testexportexpressionmatrix.cpp b/src/tests/testexportexpressionmatrix.cpp similarity index 86% rename from tests/testexportexpressionmatrix.cpp rename to src/tests/testexportexpressionmatrix.cpp index f96fafb..da43371 100644 --- a/tests/testexportexpressionmatrix.cpp +++ b/src/tests/testexportexpressionmatrix.cpp @@ -3,9 +3,10 @@ #include #include "testexportexpressionmatrix.h" -#include "analyticfactory.h" -#include "datafactory.h" -#include "exportexpressionmatrix_input.h" +#include "../core/analyticfactory.h" +#include "../core/datafactory.h" +#include "../core/exportexpressionmatrix_input.h" +#include "../core/expressionmatrix_gene.h" @@ -24,7 +25,7 @@ void TestExportExpressionMatrix::test() // create metadata QStringList geneNames; QStringList sampleNames; - QString noSampleToken {"NA"}; + QString nanToken {"NA"}; for ( int i = 0; i < numGenes; ++i ) { @@ -50,9 +51,9 @@ void TestExportExpressionMatrix::test() matrix->initialize(geneNames, sampleNames); ExpressionMatrix::Gene gene(matrix); - for ( int i = 0; i < matrix->getGeneSize(); ++i ) + for ( int i = 0; i < matrix->geneSize(); ++i ) { - for ( int j = 0; j < matrix->getSampleSize(); ++j ) + for ( int j = 0; j < matrix->sampleSize(); ++j ) { gene[j] = testExpressions[i * numSamples + j]; } @@ -60,8 +61,6 @@ void TestExportExpressionMatrix::test() gene.write(i); } - matrix->setTransform(ExpressionMatrix::Transform::None); - dataRef->data()->finish(); dataRef->finalize(); @@ -70,7 +69,7 @@ void TestExportExpressionMatrix::test() auto manager = qobject_cast(abstractManager.release()); manager->set(ExportExpressionMatrix::Input::InputData, emxPath); manager->set(ExportExpressionMatrix::Input::OutputFile, txtPath); - manager->set(ExportExpressionMatrix::Input::NoSampleToken, noSampleToken); + manager->set(ExportExpressionMatrix::Input::NANToken, nanToken); // run analytic manager->initialize(); @@ -101,7 +100,7 @@ void TestExportExpressionMatrix::test() for ( int j = 1; j < words.size(); ++j ) { - if ( words.at(j) == noSampleToken ) + if ( words.at(j) == nanToken ) { expressions[(i - 1) * numSamples + (j - 1)] = NAN; } diff --git a/tests/testexportexpressionmatrix.h b/src/tests/testexportexpressionmatrix.h similarity index 100% rename from tests/testexportexpressionmatrix.h rename to src/tests/testexportexpressionmatrix.h diff --git a/tests/testexpressionmatrix.cpp b/src/tests/testexpressionmatrix.cpp similarity index 72% rename from tests/testexpressionmatrix.cpp rename to src/tests/testexpressionmatrix.cpp index 411d174..964ea41 100644 --- a/tests/testexpressionmatrix.cpp +++ b/src/tests/testexpressionmatrix.cpp @@ -2,8 +2,9 @@ #include #include "testexpressionmatrix.h" -#include "datafactory.h" -#include "expressionmatrix.h" +#include "../core/datafactory.h" +#include "../core/expressionmatrix.h" +#include "../core/expressionmatrix_gene.h" @@ -21,12 +22,13 @@ void TestExpressionMatrix::test() // create metadata QStringList geneNames; + QStringList sampleNames; + for ( int i = 0; i < numGenes; ++i ) { geneNames.append(QString::number(i)); } - QStringList sampleNames; for ( int i = 0; i < numSamples; ++i ) { sampleNames.append(QString::number(i)); @@ -35,16 +37,16 @@ void TestExpressionMatrix::test() // create data object QString path {QDir::tempPath() + "/test.emx"}; - std::unique_ptr dataRef {new Ace::DataObject(path, DataFactory::ExpressionMatrixType, EMetadata(EMetadata::Object))}; + std::unique_ptr dataRef {new Ace::DataObject(path, DataFactory::ExpressionMatrixType, EMetaObject())}; ExpressionMatrix* matrix {dataRef->data()->cast()}; // write data to file matrix->initialize(geneNames, sampleNames); ExpressionMatrix::Gene gene(matrix); - for ( int i = 0; i < matrix->getGeneSize(); ++i ) + for ( int i = 0; i < matrix->geneSize(); ++i ) { - for ( int j = 0; j < matrix->getSampleSize(); ++j ) + for ( int j = 0; j < matrix->sampleSize(); ++j ) { gene[j] = testExpressions[i * numSamples + j]; } @@ -55,8 +57,8 @@ void TestExpressionMatrix::test() matrix->finish(); // read expression data from file - std::unique_ptr expressions {matrix->dumpRawData()}; + QVector expressions {matrix->dumpRawData()}; // verify expression data - QVERIFY(!memcmp(testExpressions.data(), expressions.get(), testExpressions.size() * sizeof(float))); + QVERIFY(!memcmp(testExpressions.data(), expressions.data(), testExpressions.size() * sizeof(float))); } diff --git a/tests/testexpressionmatrix.h b/src/tests/testexpressionmatrix.h similarity index 100% rename from tests/testexpressionmatrix.h rename to src/tests/testexpressionmatrix.h diff --git a/tests/testimportcorrelationmatrix.cpp b/src/tests/testimportcorrelationmatrix.cpp similarity index 96% rename from tests/testimportcorrelationmatrix.cpp rename to src/tests/testimportcorrelationmatrix.cpp index 5ace99b..bdab41d 100644 --- a/tests/testimportcorrelationmatrix.cpp +++ b/src/tests/testimportcorrelationmatrix.cpp @@ -3,9 +3,9 @@ #include #include "testimportcorrelationmatrix.h" -#include "analyticfactory.h" -#include "datafactory.h" -#include "importcorrelationmatrix_input.h" +#include "../core/analyticfactory.h" +#include "../core/datafactory.h" +#include "../core/importcorrelationmatrix_input.h" diff --git a/tests/testimportcorrelationmatrix.h b/src/tests/testimportcorrelationmatrix.h similarity index 90% rename from tests/testimportcorrelationmatrix.h rename to src/tests/testimportcorrelationmatrix.h index af0b47f..59851a6 100644 --- a/tests/testimportcorrelationmatrix.h +++ b/src/tests/testimportcorrelationmatrix.h @@ -2,7 +2,7 @@ #define TESTIMPORTCORRELATIONMATRIX_H #include -#include "pairwise_index.h" +#include "../core/pairwise_index.h" diff --git a/tests/testimportexpressionmatrix.cpp b/src/tests/testimportexpressionmatrix.cpp similarity index 86% rename from tests/testimportexpressionmatrix.cpp rename to src/tests/testimportexpressionmatrix.cpp index 5d2c166..f50a2bd 100644 --- a/tests/testimportexpressionmatrix.cpp +++ b/src/tests/testimportexpressionmatrix.cpp @@ -3,9 +3,9 @@ #include #include "testimportexpressionmatrix.h" -#include "analyticfactory.h" -#include "datafactory.h" -#include "importexpressionmatrix_input.h" +#include "../core/analyticfactory.h" +#include "../core/datafactory.h" +#include "../core/importexpressionmatrix_input.h" @@ -24,7 +24,7 @@ void TestImportExpressionMatrix::test() // create metadata QStringList geneNames; QStringList sampleNames; - QString noSampleToken {"NA"}; + QString nanToken {"NA"}; for ( int i = 0; i < numGenes; ++i ) { @@ -66,7 +66,7 @@ void TestImportExpressionMatrix::test() if ( std::isnan(value) ) { - stream << "\t" << noSampleToken; + stream << "\t" << nanToken; } else { @@ -84,7 +84,7 @@ void TestImportExpressionMatrix::test() auto manager = qobject_cast(abstractManager.release()); manager->set(ImportExpressionMatrix::Input::InputFile, txtPath); manager->set(ImportExpressionMatrix::Input::OutputData, emxPath); - manager->set(ImportExpressionMatrix::Input::NoSampleToken, noSampleToken); + manager->set(ImportExpressionMatrix::Input::NANToken, nanToken); // run analytic manager->initialize(); @@ -94,14 +94,14 @@ void TestImportExpressionMatrix::test() // read expression data from file std::unique_ptr dataRef {new Ace::DataObject(emxPath)}; ExpressionMatrix* matrix {dataRef->data()->cast()}; - std::unique_ptr expressions {matrix->dumpRawData()}; + QVector expressions {matrix->dumpRawData()}; // verify expression data float error = 0; for ( int i = 0; i < testExpressions.size(); ++i ) { - error += fabs(testExpressions[i] - expressions.get()[i]); + error += fabs(testExpressions[i] - expressions[i]); } error /= testExpressions.size(); diff --git a/tests/testimportexpressionmatrix.h b/src/tests/testimportexpressionmatrix.h similarity index 100% rename from tests/testimportexpressionmatrix.h rename to src/tests/testimportexpressionmatrix.h diff --git a/tests/testrmt.cpp b/src/tests/testrmt.cpp similarity index 91% rename from tests/testrmt.cpp rename to src/tests/testrmt.cpp index b6cbc67..0809fb6 100644 --- a/tests/testrmt.cpp +++ b/src/tests/testrmt.cpp @@ -3,10 +3,11 @@ #include #include "testrmt.h" -#include "analyticfactory.h" -#include "datafactory.h" -#include "rmt_input.h" -#include "correlationmatrix.h" +#include "../core/analyticfactory.h" +#include "../core/datafactory.h" +#include "../core/rmt_input.h" +#include "../core/correlationmatrix.h" +#include "../core/correlationmatrix_pair.h" diff --git a/tests/testrmt.h b/src/tests/testrmt.h similarity index 86% rename from tests/testrmt.h rename to src/tests/testrmt.h index f9ac81c..e85e020 100644 --- a/tests/testrmt.h +++ b/src/tests/testrmt.h @@ -2,7 +2,7 @@ #define TESTRMT_H #include -#include "pairwise_index.h" +#include "../core/pairwise_index.h" diff --git a/src/tests/tests.pro b/src/tests/tests.pro new file mode 100644 index 0000000..b725070 --- /dev/null +++ b/src/tests/tests.pro @@ -0,0 +1,39 @@ + +# Include common settings +include (../KINC.pri) + +# Basic settings +QT += testlib +TARGET = kinc-tests +TEMPLATE = app +CONFIG += debug + +# Source files +SOURCES += \ + testclustermatrix.cpp \ + testcorrelationmatrix.cpp \ + testexportcorrelationmatrix.cpp \ + testexportexpressionmatrix.cpp \ + testexpressionmatrix.cpp \ + testimportcorrelationmatrix.cpp \ + testimportexpressionmatrix.cpp \ + testrmt.cpp \ + testsimilarity.cpp \ + main.cpp + +HEADERS += \ + testclustermatrix.h \ + testcorrelationmatrix.h \ + testexportcorrelationmatrix.h \ + testexportexpressionmatrix.h \ + testexpressionmatrix.h \ + testimportcorrelationmatrix.h \ + testimportexpressionmatrix.h \ + testrmt.h \ + testsimilarity.h + +# Installation instructions +isEmpty(PREFIX) { PREFIX = /usr/local } +program.path = $${PREFIX}/bin +program.files = $${PWD}/../../build/tests/$${TARGET} +INSTALLS += program diff --git a/tests/testsimilarity.cpp b/src/tests/testsimilarity.cpp similarity index 88% rename from tests/testsimilarity.cpp rename to src/tests/testsimilarity.cpp index dda376b..beee434 100644 --- a/tests/testsimilarity.cpp +++ b/src/tests/testsimilarity.cpp @@ -3,9 +3,10 @@ #include #include "testsimilarity.h" -#include "analyticfactory.h" -#include "datafactory.h" -#include "similarity_input.h" +#include "../core/analyticfactory.h" +#include "../core/datafactory.h" +#include "../core/similarity_input.h" +#include "../core/expressionmatrix_gene.h" @@ -24,7 +25,6 @@ void TestSimilarity::test() // create metadata QStringList geneNames; QStringList sampleNames; - QString noSampleToken {"NA"}; for ( int i = 0; i < numGenes; ++i ) { @@ -52,9 +52,9 @@ void TestSimilarity::test() emx->initialize(geneNames, sampleNames); ExpressionMatrix::Gene gene(emx); - for ( int i = 0; i < emx->getGeneSize(); ++i ) + for ( int i = 0; i < emx->geneSize(); ++i ) { - for ( int j = 0; j < emx->getSampleSize(); ++j ) + for ( int j = 0; j < emx->sampleSize(); ++j ) { gene[j] = testExpressions[i * numSamples + j]; } @@ -62,8 +62,6 @@ void TestSimilarity::test() gene.write(i); } - emx->setTransform(ExpressionMatrix::Transform::None); - emxDataRef->data()->finish(); emxDataRef->finalize(); diff --git a/tests/testsimilarity.h b/src/tests/testsimilarity.h similarity index 89% rename from tests/testsimilarity.h rename to src/tests/testsimilarity.h index 74137b9..9b531c4 100644 --- a/tests/testsimilarity.h +++ b/src/tests/testsimilarity.h @@ -2,7 +2,7 @@ #define TESTSIMILARITY_H #include -#include "pairwise_index.h" +#include "../core/pairwise_index.h" diff --git a/tests/tests.pro b/tests/tests.pro deleted file mode 100644 index 7d22829..0000000 --- a/tests/tests.pro +++ /dev/null @@ -1,118 +0,0 @@ -# General build variables -TARGET = tests -TEMPLATE = app -CONFIG += c++11 debug - -# Qt libraries -QT += core testlib - -# external libraries -LIBS += -lOpenCL -L/usr/local/lib64/ -L$$(HOME)/software/lib -lacecore -lgsl -lgslcblas -llapack -llapacke -INCLUDEPATH += $$(HOME)/software/include -INCLUDEPATH += ../src - -# HACK -INCLUDEPATH += $$(HOME)/software/include/ace - -# Preprocessor defines -DEFINES += QT_DEPRECATED_WARNINGS - -# Source files -SOURCES += \ - ../src/analyticfactory.cpp \ - ../src/ccmatrix.cpp \ - ../src/correlationmatrix.cpp \ - ../src/datafactory.cpp \ - ../src/exportcorrelationmatrix_input.cpp \ - ../src/exportcorrelationmatrix.cpp \ - ../src/exportexpressionmatrix_input.cpp \ - ../src/exportexpressionmatrix.cpp \ - ../src/expressionmatrix.cpp \ - ../src/extract_input.cpp \ - ../src/extract.cpp \ - ../src/importcorrelationmatrix_input.cpp \ - ../src/importcorrelationmatrix.cpp \ - ../src/importexpressionmatrix_input.cpp \ - ../src/importexpressionmatrix.cpp \ - ../src/pairwise_clustering.cpp \ - ../src/pairwise_correlation.cpp \ - ../src/pairwise_gmm.cpp \ - ../src/pairwise_index.cpp \ - ../src/pairwise_kmeans.cpp \ - ../src/pairwise_linalg.cpp \ - ../src/pairwise_matrix.cpp \ - ../src/pairwise_pearson.cpp \ - ../src/pairwise_spearman.cpp \ - ../src/rmt_input.cpp \ - ../src/rmt.cpp \ - ../src/similarity_input.cpp \ - ../src/similarity_opencl_fetchpair.cpp \ - ../src/similarity_opencl_gmm.cpp \ - ../src/similarity_opencl_kmeans.cpp \ - ../src/similarity_opencl_pearson.cpp \ - ../src/similarity_opencl_spearman.cpp \ - ../src/similarity_opencl_worker.cpp \ - ../src/similarity_opencl.cpp \ - ../src/similarity_resultblock.cpp \ - ../src/similarity_serial.cpp \ - ../src/similarity_workblock.cpp \ - ../src/similarity.cpp \ - testclustermatrix.cpp \ - testcorrelationmatrix.cpp \ - testexportcorrelationmatrix.cpp \ - testexportexpressionmatrix.cpp \ - testexpressionmatrix.cpp \ - testimportcorrelationmatrix.cpp \ - testimportexpressionmatrix.cpp \ - testrmt.cpp \ - testsimilarity.cpp \ - main.cpp - -HEADERS += \ - ../src/analyticfactory.h \ - ../src/ccmatrix.h \ - ../src/correlationmatrix.h \ - ../src/datafactory.h \ - ../src/expressionmatrix.h \ - ../src/extract_input.h \ - ../src/extract.h \ - ../src/exportcorrelationmatrix_input.h \ - ../src/exportcorrelationmatrix.h \ - ../src/exportexpressionmatrix_input.h \ - ../src/exportexpressionmatrix.h \ - ../src/importcorrelationmatrix_input.h \ - ../src/importcorrelationmatrix.h \ - ../src/importexpressionmatrix_input.h \ - ../src/importexpressionmatrix.h \ - ../src/pairwise_clustering.h \ - ../src/pairwise_correlation.h \ - ../src/pairwise_gmm.h \ - ../src/pairwise_index.h \ - ../src/pairwise_kmeans.h \ - ../src/pairwise_linalg.h \ - ../src/pairwise_matrix.h \ - ../src/pairwise_pearson.h \ - ../src/pairwise_spearman.h \ - ../src/rmt_input.h \ - ../src/rmt.h \ - ../src/similarity_input.h \ - ../src/similarity_opencl_fetchpair.h \ - ../src/similarity_opencl_gmm.h \ - ../src/similarity_opencl_kmeans.h \ - ../src/similarity_opencl_pearson.h \ - ../src/similarity_opencl_spearman.h \ - ../src/similarity_opencl_worker.h \ - ../src/similarity_opencl.h \ - ../src/similarity_resultblock.h \ - ../src/similarity_serial.h \ - ../src/similarity_workblock.h \ - ../src/similarity.h \ - testclustermatrix.h \ - testcorrelationmatrix.h \ - testexportcorrelationmatrix.h \ - testexportexpressionmatrix.h \ - testexpressionmatrix.h \ - testimportcorrelationmatrix.h \ - testimportexpressionmatrix.h \ - testrmt.h \ - testsimilarity.h