Merge branch 'release/0.0.0.9006'

epurdom · Mar 18, 2016 · d9764a5 · d9764a5
2 parents cc9fd6e + 0d63a6a
commit d9764a5
Show file tree

Hide file tree

Showing 15 changed files with 1,113 additions and 25 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,10 @@
-Unreleased:
+0.0.0.9006:
+
+* fixed so that mergeClusters, clusterHclust, and getBestGenes will appropriately convert if the input of clustering vector is a factor rather than numeric (with warning).
+* fixed mergeClusters to have option to indicate that input matrix is a count matrix (in which case will create dendrogram with log(counts+1) and will do getBestGenes with the voom correction) 
+* added more tutoral-oriented vignette (old vignette is now the documentation vignette with more detail about the internal workings of package). Currently is just simulated data, but will be updated to real single-cell sequencing dataset.
+
+0.0.0.9005
 
 * Changed simulated data so load all with data(simData) rather than separate calls for simData and simCount. Also added 'trueCluster' vector to give true cluster assignments of simulated data
 * added dendro example to getBestGenes

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: clusterCells
 Title: Compare clusterings for single-cell sequencing
-Version: 0.0.0.9005
+Version: 0.0.0.9006
 Description: This package provides functions for running and comparing many
     different clusterings of single-cell sequencing data.
 Authors@R: c(person("Elizabeth", "Purdom", email = "epurdom@stat.berkeley.edu",

diff --git a/R/clusterHclust.R b/R/clusterHclust.R
@@ -4,13 +4,15 @@
 #'
 #'
 #' @param dat data to define the mediods from. Samples on rows, features on columns
-#' @param cl A vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.
+#' @param cl A numeric vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.
 #' @param full Logical as to whether to expand to include all samples as tips (useful for heatmap), or to have dendrogram just of the clusters (useful for getBestGenes)
 #' @param unassigned what to do with with unassigned ("-1") cluster; only relevant if full=TRUE. Default is to remove those samples, so that only get heatmap of clustered observations. ``outgroup'' puts the unassigned into an outgroup on the dendrogram. ``cluster'' clusters them with the mediods. 
 #' @param ... arguments passed to hclust
 #' @details If full=TRUE, then the expanded dendrogram is created by giving all of the samples the mediod of the cluster and applying hclust to it; if further unassigned=="cluster", then the expanded dendrogram is created by hclust of the expanded mediod data plus the original unclustered observations.
 #'
 #'
+#' @return A dendrogram for the clusters. If full=TRUE, the number of leaves is equal to the number of samples. If full=FALSE, the number of leaves is equal to the number of clusters.
+
 #' @examples
 #' data(simData)
 #' #create a clustering, for 8 clusters (truth was 3)
@@ -29,6 +31,9 @@ clusterHclust<-function(dat,cl,full=TRUE,unassigned=c("remove","outgroup","clust
 	unassigned<-match.arg(unassigned)
 	if(nrow(dat)!=length(cl)) stop("cl must be the same length as the number of rows of dat")
 	if(is.null(rownames(dat))) rownames(dat)<-as.character(1:nrow(dat))
+	if(is.factor(cl)){warning("cl is a factor. Converting to numeric, which may not result in valid conversion")
+			cl<-as.numeric(as.character(cl))
+	}
 	whRm<- which(cl!=-1)
 	clFactor<-factor(cl[whRm])
 	mediods<-do.call("rbind",by(dat[whRm,],clFactor,function(x){apply(x,2,median)}))
@@ -86,36 +91,59 @@ clusterHclust<-function(dat,cl,full=TRUE,unassigned=c("remove","outgroup","clust
 #'
 #'
 #' @param dat data to perform the test on. Samples on rows, features on columns
-#' @param cl A vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.
+#' @param cl A numeric vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.
+#' @param dendro If provided, is dendrogram providing hierarchical clustering of clusters in cl; mainly useful to speed up calculations if clusterHclust has already been called on the (cl,dat) pair. The default (NULL) is to recalculate it with the given (cl,dat) pair.
 #' @param mergeMethod method for calculating proportion of non-null that will be used to merge clusters (if 'none', no merging will be done). See details for description of methods. 
 #' @param cutoff cutoff for merging clusters based on the proportion of non-nulls in the comparison of the clusters (i.e. value between 0,1, where lower values will make it harder to merge clusters).
 #' @param plotType what type of plotting of dendrogram. If 'all', then all the estimates of proportion non-null will be plotted; if 'mergeMethod', then only the value used in the merging is plotted for each node. 
+#' @param countData logical as to whether input data is a count matrix (in which case log(count+1) will be used for \code{\link{clusterHclust}} and voom correction will be used in \code{\link{getBestGenes}}).
 #' @param ... arguments passed to the \code{\link{plot.phylo}} function of \code{ade4} that plots the dendrogram. 
 #' 
 #' @details "JC" refers to the method of Ji and Cai (2007), and implementation of "JC" method is copied from code available on Jiashin Ji's website, December 16, 2015 (http://www.stat.cmu.edu/~jiashun/Research/software/NullandProp/). "locfdr" refers to the method of Efron (2004) and is implemented in the package \code{\link{locfdr}}. "MB" refers to the method of Meinshausen and Buhlmann (2005) and is implemented in the package \code{\link{howmany}}. "adjP" refers to the proportion of genes that are found significant based on a FDR adjusted p-values (method "BH") and a cutoff of 0.05. 
 #'
 #' @details If \code{mergeMethod} is not equal to 'none' then the plotting will indicate the merged clusters (assuming \code{plotType} is not 'none'). 
+#' @return Returns (invisibly) a list with elements
+#' \itemize{
+
+#' \item{\code{clustering}}{a vector of length equal to nrows(dat) giving the integer-valued cluster ids for each sample. "-1" indicates the sample was not clustered.}
+#' \item{\code{oldClToNew}}{A table of the old cluster labels to the new cluster labels}
+#' \item{\code{propDE}}{A table of the proportions that are DE on each node}
+#' \item{\code{originalClusterDendro}}{The dendrogram on which the merging was based (based on the original clustering)}
+#' }
 #' @examples
 #' data(simData)
 #'  #create a clustering, for 8 clusters (truth was 3)
 #'  cl<-clusterAll(simData,clusterFunction="pam",subsample=FALSE,
 #'  sequential=FALSE, clusterDArgs=list(k=8))$cl
 #' #merge clusters with plotting. Note argument 'use.edge.length' can improve readability
 #' mergeResults<-mergeClusters(simData,cl=cl,plot=TRUE,plotType="all",
-#' mergeMethod="adjP",use.edge.length=FALSE)
+#' mergeMethod="adjP",use.edge.length=FALSE,countData=FALSE)
 #' #compare merged to original on heatmap
 #' hclData<-clusterHclust(dat=simData,cl,full=TRUE)
 #' dualHeatmap(cl,heatData=simCount,clusterData=hclData,colorScale=seqPal5,
 #'	annCol=data.frame(Original=cl,Merged=mergeResults$cl,Truth=trueCluster))
 
 
-mergeClusters<-function(dat,cl,mergeMethod=c("none","adjP","locfdr","MB","JC"),cutoff=0.1,plotType=c("none","all","mergeMethod"),...){
-	dendro<-clusterHclust(dat=dat,cl,full=FALSE)
+mergeClusters<-function(dat,cl,dendro=NULL,mergeMethod=c("none","adjP","locfdr","MB","JC"),cutoff=0.1,plotType=c("none","all","mergeMethod"),countData=TRUE,...){
+	if(is.factor(cl)){warning("cl is a factor. Converting to numeric, which may not result in valid conversion")
+			cl<-as.numeric(as.character(cl))
+	}
+	if(!is.null(dendro)){
+		#check valid
+		ncluster<-length(table(cl[cl>0]))
+		if(nobs(dendro)!=ncluster){
+			warning("Not a valid input dendrogram (not equal to the number of non -1 clusters in cl). Will recompute dendrogram")
+			dendro<-NULL
+		}
+	}
+	if(is.null(dendro)){
+		dendro<-if(countData) clusterHclust(dat=log(dat+1),cl,full=FALSE) else clusterHclust(dat=dat,cl,full=FALSE)		
+	}
   	mergeMethod<-match.arg(mergeMethod)
   	plotType<-match.arg(plotType)
 	if(plotType=="mergeValue" & mergeMethod=="none") stop("can only plot merge method values if one method is selected")
 	#get test-statistics for the contrasts corresponding to each node (and return all)
-	sigTable<-getBestGenes(cl,dat,type=c("Dendro"),dendro=dendro,returnType=c("Table"),contrastAdj=c("All"),number=ncol(dat),p.value=1)
+	sigTable<-getBestGenes(cl,dat,type=c("Dendro"),dendro=dendro,returnType=c("Table"),contrastAdj=c("All"),number=ncol(dat),p.value=1,voomCorrection=countData)
 	#divide table into each node.
 	sigByNode<-by(sigTable,sigTable$ContrastName,function(x){
 		mb<-.myTryFunc(pvalues=x$P.Value,FUN=.m1_MB)
@@ -187,7 +215,7 @@ mergeClusters<-function(dat,cl,mergeMethod=c("none","adjP","locfdr","MB","JC"),c
 	nodePropTable<-do.call("rbind",sigByNode)
 	nodePropTable<-data.frame("Node"=names(sigByNode),"Contrast"=sigTable$Contrast[match(names(sigByNode),sigTable$ContrastName)],nodePropTable)
 
-	invisible(list(cl=newcl,oldClToNew=table(Original=cl,New=newcl),propDE=nodePropTable,originalClusterDendro=dendro))
+	invisible(list(clustering=newcl,oldClToNew=table(Original=cl,New=newcl),propDE=nodePropTable,originalClusterDendro=dendro))
 }
 
 .myTryFunc<-function(FUN,...){

diff --git a/R/dualHeatmap.R b/R/dualHeatmap.R
@@ -14,7 +14,7 @@
 #' @name dualHeatmap
 #' @aliases dualHeatmap
 #' @docType methods
-#' @param clusterVector A vector with cluster assignments to show at top of heatmap
+#' @param clusterVector A numeric vector with cluster assignments to show at top of heatmap
 #' with cells. ``-1'' indicates the sample was not assigned to a cluster and
 #' gets color `white'.
 #' @param heatData matrix to define the color scale (i.e. the counts). Will be

diff --git a/R/getGenes.R b/R/getGenes.R
@@ -21,14 +21,14 @@
 
 #' @title Function for finding best genes associated with clusters
 #' @description Calls limma on input data to determine genes most associated with found clusters (either pairwise comparisons, or following a tree that clusters the clusters).
-#' @param cl A vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.
+#' @param cl A numeric vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.
 #' @param dat data for the test, samples in rows
 #' @param type What type of test to do. `F' gives the omnibus F-statistic, `Dendro' traverses the given dendrogram and does contrasts of the samples in each side,  `Pairs' does pair-wise contrasts based on the pairs given in pairMat (if pairMat=NULL, does all pairwise), and `OneAgainstAll' compares each cluster to the average of all others. 
 #' @param dendro The dendrogram to traverse if type="Dendro". Note that this should be the dendrogram of the clusters, not of the individual samples.
 #' @param pairMat matrix giving the pairs of clusters for which to do pair-wise contrasts (must match to elements of cl). If NULL, will do all pairwise of the clusters in \code{cl} (excluding "-1" categories). Each row is a pair to be compared and must match the names of the clusters in the vector \code{cl}. 
 #' @param returnType Whether to return the index of genes, or the full table given by topTable or topTableF.
 #' @param contrastAdj What type of FDR correction to do for contrasts tests (i.e. if type='Dendro' or 'Pairs'). 
-#' @param voomCorrection Whether to perform voom correction on counts. If input to dat is count matrix, should be set to TRUE. Otherwise, dat should be log of counts (which is not preferable). Currently default is set to FALSE simply because it has not been heavily tested, but should be the default for RNA-Seq data.
+#' @param voomCorrection Whether to perform voom correction to data, e.g. if input data matrix is counts/TPM/FPKM. If input to dat consists of counts/TPM/FPKM, this argument should be set to TRUE. Otherwise, dat should be something like log of counts (which is not preferable for count data) or some other kind of similar input data that does not need a variance stabilization (e.g. microarray data). Currently the default is set to FALSE, simply because the voomCorrection has not been heavily tested. But TRUE with dat being counts/TPM/FPKM really should be the default for RNA-Seq data.
 #' @param ... options to pass to \code{\link{topTable}} or \code{\link{topTableF}} (see \code{\link{limma}} package)
 #' @details getBestGenes returns best genes corresponding to a cluster assignment. It uses limma to fit the models, and limma's functions topTable or topTableF to find the best genes.  See the options of these functions to put better control on what gets returned (e.g. only if significant, only if log-fc is above a certain amount, etc.). In particular, set 'number=' to define how many significant genes to return (where number is per contrast for the 'Pairs' or 'Dendro' option)
 #'
@@ -38,7 +38,16 @@
 #'
 #' @details  Note that the default option for \code{\link{topTable}} is to not filter based on adjusted p-values (\code{p.value=1}) and return only the top 10 most significant (\code{number=10}) -- these are options the user can change (these arguments are passed via the \code{...} in getBestGenes). In particular, it only makes sense to set \code{requireF=TRUE} if \code{p.value} is meaningful (e.g. 0.1 or 0.05); the default value of \code{p.value=1} will not result in any effect on the adjusted p-value otherwise. 
 #'
-#' @return A data.frame in the same format as \code{\link{topTable}}, except that the column name \code{ProbeID} is changed to \code{Gene} and a column \code{IndexInOriginal} is provided to give the row index of the gene in the original dataset.
+#' @return A data.frame in the same format as \code{\link{topTable}}, except for the following additional or changed columns:
+#' \itemize{
+
+#' \item{\code{Gene}}{This is the column called 'ProbeID' by \code{\link{topTable}}}
+
+#' \item{\code{IndexInOriginal}}{Gives the index of the gene to the original input dataset, \code{dat}}
+
+#' \item{\code{Contrast}}{The contrast that the results corresponds to (if applicable, depends on \code{type} argument)}
+#' \item{\code{ContrastName}}{The name of the contrast that the results corresponds to. For dendrogram searches, this will be the node of the tree of the dendrogram.}
+#' }
 #'
 #' @examples
 #' data(simData)
@@ -79,6 +88,9 @@
 getBestGenes<-function(cl,dat,type=c("F","Dendro","Pairs","OneAgainstAll"),dendro=NULL,pairMat=NULL,returnType=c("Table","Index"),contrastAdj=c("All","PerContrast","AfterF"),voomCorrection=FALSE,...){
 	#... is always sent to topTable, and nothing else
 #	require(limma)
+	if(is.factor(cl)){warning("cl is a factor. Converting to numeric, which may not result in valid conversion")
+			cl<-as.numeric(as.character(cl))
+	}
 	dat<-t(data.matrix(dat))
 	type<-match.arg(type)
 	contrastAdj<-match.arg(contrastAdj)
@@ -283,7 +295,9 @@ getBestGenes<-function(cl,dat,type=c("F","Dendro","Pairs","OneAgainstAll"),dendr
 		colnames(tt)[colnames(tt)=="ID"]<-"Gene"
 		if(nrow(tt)>0){
 			tt<-data.frame("Contrast"=unname(contrastNames[ii]),tt,row.names=NULL)
-			if(!is.null(names(contrastNames))) tt<-data.frame("ContrastName"=names(contrastNames)[ii],tt,row.names=NULL)
+			if(!is.null(names(contrastNames))){
+				 tt<-data.frame("ContrastName"=names(contrastNames)[ii],tt,row.names=NULL)
+			}
 		}
 		return(tt)
 		}))

diff --git a/man/clusterHclust.Rd b/man/clusterHclust.Rd
@@ -10,14 +10,17 @@ clusterHclust(dat, cl, full = TRUE, unassigned = c("remove", "outgroup",
 \arguments{
 \item{dat}{data to define the mediods from. Samples on rows, features on columns}
 
-\item{cl}{A vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.}
+\item{cl}{A numeric vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.}
 
 \item{full}{Logical as to whether to expand to include all samples as tips (useful for heatmap), or to have dendrogram just of the clusters (useful for getBestGenes)}
 
 \item{unassigned}{what to do with with unassigned ("-1") cluster; only relevant if full=TRUE. Default is to remove those samples, so that only get heatmap of clustered observations. ``outgroup'' puts the unassigned into an outgroup on the dendrogram. ``cluster'' clusters them with the mediods.}
 
 \item{...}{arguments passed to hclust}
 }
+\value{
+A dendrogram for the clusters. If full=TRUE, the number of leaves is equal to the number of samples. If full=FALSE, the number of leaves is equal to the number of clusters.
+}
 \description{
 Makes a dendrogram of a set of clusters based on hclust on the mediods of the cluster.
 }

diff --git a/man/compareChoices.Rd b/man/compareChoices.Rd
@@ -79,7 +79,8 @@ ks=2:4,findBestK=c(TRUE,FALSE),run=FALSE)
 checkParams <- compareChoices(lapply(ps,function(p){pcaData$x[,1:p]}), clusterMethod="pam",
 ks=2:4,findBestK=c(TRUE,FALSE),run=FALSE,subsampleArgs=list("k"=3))
 #Now actually run it
-cl <- compareChoices(lapply(ps,function(p){pcaData$x[,1:p]}), clusterMethod="pam",ks=2:4,findBestK=c(TRUE,FALSE),
+cl <- compareChoices(lapply(ps,function(p){pcaData$x[,1:p]}),
+clusterMethod="pam",ks=2:4,findBestK=c(TRUE,FALSE),
 subsampleArgs=list("k"=3))
 colnames(cl$clMat)
 #make names shorter for plotting

diff --git a/man/dualHeatmap.Rd b/man/dualHeatmap.Rd
@@ -13,7 +13,7 @@ dualHeatmap(clusterVector, heatData, clusterData = heatData, eps = 1,
   unassignedColor = "white", missingColor = "grey", ...)
 }
 \arguments{
-\item{clusterVector}{A vector with cluster assignments to show at top of heatmap
+\item{clusterVector}{A numeric vector with cluster assignments to show at top of heatmap
 with cells. ``-1'' indicates the sample was not assigned to a cluster and
 gets color `white'.}
 

diff --git a/man/getBestGenes.Rd b/man/getBestGenes.Rd
@@ -10,7 +10,7 @@ getBestGenes(cl, dat, type = c("F", "Dendro", "Pairs", "OneAgainstAll"),
   ...)
 }
 \arguments{
-\item{cl}{A vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.}
+\item{cl}{A numeric vector with cluster assignments to compare to clRef.``-1'' indicates the sample was not assigned to a cluster.}
 
 \item{dat}{data for the test, samples in rows}
 
@@ -24,12 +24,18 @@ getBestGenes(cl, dat, type = c("F", "Dendro", "Pairs", "OneAgainstAll"),
 
 \item{contrastAdj}{What type of FDR correction to do for contrasts tests (i.e. if type='Dendro' or 'Pairs').}
 
-\item{voomCorrection}{Whether to perform voom correction on counts. If input to dat is count matrix, should be set to TRUE. Otherwise, dat should be log of counts (which is not preferable). Currently default is set to FALSE simply because it has not been heavily tested, but should be the default for RNA-Seq data.}
+\item{voomCorrection}{Whether to perform voom correction to data, e.g. if input data matrix is counts/TPM/FPKM. If input to dat consists of counts/TPM/FPKM, this argument should be set to TRUE. Otherwise, dat should be something like log of counts (which is not preferable for count data) or some other kind of similar input data that does not need a variance stabilization (e.g. microarray data). Currently the default is set to FALSE, simply because the voomCorrection has not been heavily tested. But TRUE with dat being counts/TPM/FPKM really should be the default for RNA-Seq data.}
 
 \item{...}{options to pass to \code{\link{topTable}} or \code{\link{topTableF}} (see \code{\link{limma}} package)}
 }
 \value{
-A data.frame in the same format as \code{\link{topTable}}, except that the column name \code{ProbeID} is changed to \code{Gene} and a column \code{IndexInOriginal} is provided to give the row index of the gene in the original dataset.
+A data.frame in the same format as \code{\link{topTable}}, except for the following additional or changed columns:
+\itemize{
+\item{\code{Gene}}{This is the column called 'ProbeID' by \code{\link{topTable}}}
+\item{\code{IndexInOriginal}}{Gives the index of the gene to the original input dataset, \code{dat}}
+\item{\code{Contrast}}{The contrast that the results corresponds to (if applicable, depends on \code{type} argument)}
+\item{\code{ContrastName}}{The name of the contrast that the results corresponds to. For dendrogram searches, this will be the node of the tree of the dendrogram.}
+}
 }
 \description{
 Calls limma on input data to determine genes most associated with found clusters (either pairwise comparisons, or following a tree that clusters the clusters).