Merge pull request #2028 from jairideout/issue-1933

Fixes #1933
biocore · May 22, 2015 · 8c5a051 · 8c5a051
2 parents 45ea5cd + 2c21109
commit 8c5a051
Show file tree

Hide file tree

Showing 7 changed files with 53 additions and 325 deletions.
diff --git a/ChangeLog.md b/ChangeLog.md
@@ -4,15 +4,18 @@ QIIME 1.9.0-dev
 Bug fixes
 ---------
 
-* Updated minimum required version of the [qiime-default-reference](http://github.com/biocore/qiime-default-reference) package to 0.1.2. This release includes an important bug fix described in more detail in [this QIIME blog post](https://qiime.wordpress.com/2015/04/15/qiime-1-9-0-bug-affecting-pynast-alignment-of-16s-amplicons-generated-with-non-515f806r-primers/) and in [biocore/qiime-default-reference#14](https://github.com/biocore/qiime-default-reference/issues/14).
+* Updated minimum required version of the [qiime-default-reference](http://github.com/biocore/qiime-default-reference) package to 0.1.2. **This release includes an important bug fix described in more detail in [this QIIME blog post](https://qiime.wordpress.com/2015/04/15/qiime-1-9-0-bug-affecting-pynast-alignment-of-16s-amplicons-generated-with-non-515f806r-primers/) and in [biocore/qiime-default-reference#14](https://github.com/biocore/qiime-default-reference/issues/14).**
+* Fixed bug in ``differential_abundance.py`` fitZIG algorithm ([#1960](https://github.com/biocore/qiime/pull/1960)). **This was a serious bug that was encountered when users would call ``differential_abundance.py -a metagenomeSeq_fitZIG``. Any results previosuly generated with that command should be re-run.**
+* Fixed serious bug in ``observation_metadata_correlation.py``, described in [#2009](https://github.com/biocore/qiime/issues/2009). **All previous output generated with ``observation_metadata_correlation.py`` was incorrect, and analyses using those results should be re-run.** This most commonly would have resulted in massive Type 2 error (false negatives), where observations whose abundance is correlated with metadata are not reported, though Type 1 error (false positives) are also possible.
 * ``count_seqs.py`` no longer fails on empty files. [#1991](https://github.com/biocore/qiime/issues/1991)
-* Fixed bug in ``differential_abundance.py`` fitZIG algorithm ([#1960](https://github.com/biocore/qiime/pull/1960)). This was a serious bug that was encountered when users would call ``differential_abundance.py -a metagenomeSeq_fitZIG``. Any results generated with that command in QIIME 1.9.0 should be re-run.
 * Updated minimum required version of [biom-format](http://github.com/biocore/biom-format) package to 2.1.4. This is a bug fix release. Details are available in the [biom-format ChangeLog](https://github.com/biocore/biom-format/blob/master/ChangeLog.md).
 * Updated minimum required version of [Emperor](http://github.com/biocore/emperor) package to 0.9.51.
 * Forced BIOM table type to "OTU table" for all tables written with QIIME. This fixes [#1928](https://github.com/biocore/qiime/issues/1928).
 * The ``--similarity`` option in ``pick_otus.py`` now only accepts sequence similarity thresholds between 0.0 and 1.0 (inclusive). Previous behavior would allow values outside this range, which would cause uninformative error messages to be raised by the external tools that ``pick_otus.py`` wraps ([#1979](https://github.com/biocore/qiime/issues/1979)).
 * ``split_libraries_fastq.py`` now explicitly disallows ``-p 0``. This could lead to empty sequences being written to the resulting output file ([#1984](https://github.com/biocore/qiime/issues/1984)).
-* Fixed issued where ``filter_samples_from_otu_table.py`` could only filter the mapping file when ``--valid_states`` was passed as the filtering method ([#2003](https://github.com/biocore/qiime/issues/2003)). 
+* Fixed issued where ``filter_samples_from_otu_table.py`` could only filter the mapping file when ``--valid_states`` was passed as the filtering method ([#2003](https://github.com/biocore/qiime/issues/2003)).
+* Fixed bug where distance matrix files generated by QIIME (e.g., using ``beta_diversity.py``) could have diagonals with values that were close to zero in rare cases (depending on input data, machine architecture, installed dependencies, etc.). These files could not be loaded by QIIME scripts that accepted distance matrix files as input (e.g., ``principal_coordinates.py``) and would result in an error message stating that the distance matrix was not hollow. Values on the diagonal that are close to zero are now set to 0.0 ([#1933](https://github.com/biocore/qiime/issues/1933)).
+
 
 Usability enhancements
 ----------------------

diff --git a/qiime/beta_diversity.py b/qiime/beta_diversity.py
@@ -122,74 +122,6 @@ def list_known_metrics():
         list_known_phylogenetic_metrics()
 
 
-class BetaDiversityCalc(FunctionWithParams):
-
-    """A BetaDiversityCalc takes taxon x sample counts, returns distance mat.
-
-    Some BetaDiversityCalcs also require a tree.
-
-    typical usage is:
-    calc = BetaDiversityCalc(my_metric, 'metric name', \
-        IsPhylogenetic=False)
-    calc(data_path=stuff, result_path=stuff2)
-    """
-    _tracked_properties = ['IsPhylogenetic', 'Metric']
-
-    def __init__(self, metric, name, is_phylogenetic=False, params=None):
-        """Return new BetaDiversityCalc object with specified params.
-
-        Note: metric must conform to the following interface:
-            must return 2d sample distance matrix with sample order preserved.
-            * phylogenetic methods: take samples by taxa array, taxon_names,
-            cogent PhyloNode tree
-            * nonphylo methods: take only samples by taxa array, numpy 2d
-
-        Not sure what params does
-        """
-        self.Metric = metric  # should be f(table, tree) -> dist matrix
-        self.Name = name
-        self.IsPhylogenetic = is_phylogenetic
-        self.Params = params or {}
-
-    def getResult(self, data_path, tree_path=None):
-        """Returns distance matrix from (indcidence matrix and optionally tree).
-
-        Parameters:
-
-        data_path: path to data file, matrix (samples = cols, taxa = rows)
-        in tab-delimited text format
-
-        tree_path: path or object.
-        if method is phylogenetic, must supply tree_path.
-        if path, path to
-        Newick-format tree file where taxon ids match taxon ids in the
-        input data file.
-
-        returns 2d dist matrix, list of sample names ordered as in dist mtx
-        """
-        # if it's a phylogenetic metric, read the tree
-        if self.IsPhylogenetic:
-            tree = self.getTree(tree_path)
-        else:
-            tree = None
-
-        otu_table = load_table(data_path)
-        otumtx = asarray([v for v in otu_table.iter_data(axis='sample')])
-
-        # get the 2d dist matrix from beta diversity analysis
-        if self.IsPhylogenetic:
-            return (self.Metric(otumtx, otu_table.ids(axis='observation'),
-                                tree, otu_table.ids()),
-                    list(otu_table.ids()))
-        else:
-            return self.Metric(otumtx), list(otu_table.ids())
-
-    def formatResult(self, result):
-        """Generate formatted distance matrix. result is (data, sample_names)"""
-        data, sample_names = result
-        return format_distance_matrix(sample_names, data)
-
-
 def single_file_beta(input_path, metrics, tree_path, output_dir,
                      rowids=None, full_tree=False):
     """ does beta diversity calc on a single otu table
@@ -293,114 +225,10 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
                                              make_subtree=(not full_tree))
                         row_dissims.append(dissims)
 
-            # rows_outfilepath = os.path.join(output_dir, metric + '_' +\
-            #     '_'.join(rowids_list) + '_' + os.path.split(input_path)[1])
-            f = open(outfilepath, 'w')
-            f.write(
-                format_matrix(
-                    row_dissims,
-                    rowids_list,
-                    otu_table.ids()))
-            f.close()
-
-
-def single_object_beta(otu_table, metrics, tr, rowids=None,
-                       full_tree=False):
-    """mod of single_file_beta to recieve and return otu obj, tree str
-
-    uses name in metrics to name output beta diversity files
-    assumes input tree is already trimmed to contain only otus present
-    in otu_table, doesn't call getSubTree()
-    inputs:
-                otu_table -- a otu_table in the biom format
-                metrics -- metrics (str, comma delimited if more than 1 metric)
-                tr -- a phylonode cogent tree object if needed by the chosen beta
-                                        diversity metric
-                rowids -- comma seperated string
-    """
-    otumtx = asarray([v for v in otu_table.iter_sample_data()])
-
-    if tr:
-        tree = tr
-    else:
-        tree = None
-
-    metrics_list = metrics.split(',')
-
-    for metric in metrics_list:
-        try:
-            metric_f = get_nonphylogenetic_metric(metric)
-            is_phylogenetic = False
-        except AttributeError:
-            try:
-                metric_f = get_phylogenetic_metric(metric)
-                is_phylogenetic = True
-                if tree is None:
-                    stderr.write("metric %s requires a tree, but none found\n"
-                                 % (metric,))
-                    exit(1)
-            except AttributeError:
-                stderr.write("Could not find metric %s.\n\nKnown metrics are: %s\n"
-                             % (metric, ', '.join(list_known_metrics())))
-                exit(1)
-        if rowids is None:
-            # standard, full way
-            if is_phylogenetic:
-                dissims = metric_f(otumtx, otu_table.ids(axis='observation'),
-                                   tree, otu_table.ids(),
-                                   make_subtree=(not full_tree))
-            else:
-                dissims = metric_f(otumtx)
-
-            return (
-                format_distance_matrix(
-                    otu_table.ids(),
-                    dissims).split(
-                    '\n')
-            )
-        else:
-            # only calc d(rowid1, *) for each rowid
-            rowids_list = rowids.split(',')
-            row_dissims = []  # same order as rowids_list
-            for rowid in rowids_list:
-                rowidx = otu_table.index(rowid)
-
-                # first test if we can the dissim is a fn of only the pair
-                # if not, just calc the whole matrix
-                if metric_f.__name__ == 'dist_chisq' or \
-                        metric_f.__name__ == 'dist_gower' or \
-                        metric_f.__name__ == 'dist_hellinger' or\
-                        metric_f.__name__ == 'binary_dist_chisq':
-                    warnings.warn('dissimilarity ' + metric_f.__name__ +
-                                  ' is not parallelized, calculating the whole matrix...')
-                    row_dissims.append(metric_f(otumtx)[rowidx])
-                else:
-                    try:
-                        row_metric = get_phylogenetic_row_metric(metric)
-                    except AttributeError:
-                        # do element by element
-                        dissims = []
-                        sample_ids = otu_table.ids()
-                        observation_ids = otu_table.ids(axis='observation')
-                        for i in range(len(sample_ids)):
-                            if is_phylogenetic:
-                                dissim = metric_f(
-                                    otumtx[[rowidx, i], :], observation_ids,
-                                    tree, [sample_ids[rowidx], sample_ids[i]],
-                                    make_subtree=(not full_tree))[0, 1]
-                            else:
-                                dissim = metric_f(otumtx[[rowidx, i], :])[0, 1]
-                            dissims.append(dissim)
-                        row_dissims.append(dissims)
-                    else:
-                        # do whole row at once
-                        dissims = row_metric(otumtx,
-                                             otu_table.ids(axis='observation'),
-                                             tree, otu_table.ids(), rowid,
-                                             make_subtree=(not full_tree))
-                        row_dissims.append(dissims)
-
-            return format_matrix(row_dissims, rowids_list, otu_table.ids())
+            with open(outfilepath, 'w') as f:
+                f.write(format_matrix(row_dissims, rowids_list,
+                                      otu_table.ids(),
+                                      convert_matching_names_to_zero=True))
 
 
 def multiple_file_beta(input_path, output_dir, metrics, tree_path,

diff --git a/qiime/format.py b/qiime/format.py
@@ -87,7 +87,7 @@ def format_qiime_parameters(params, header="#QIIME parameters"):
 
             qiime_params.append(full_line)
     return qiime_params
-    
+
 
 def format_add_taxa_summary_mapping(summary, tax_order, mapping, header,
                                     delimiter=';'):
@@ -275,16 +275,30 @@ def format_otu_map(otu_map, otu_id_prefix):
 
 def format_distance_matrix(labels, data):
     """Writes distance matrix as tab-delimited text, uses format_matrix"""
-    return format_matrix(data, labels, labels)
+    return format_matrix(data, labels, labels,
+                         convert_matching_names_to_zero=True)
 
 
-def format_matrix(data, row_names, col_names):
+def format_matrix(data, row_names, col_names,
+                  convert_matching_names_to_zero=False):
     """Writes matrix as tab-delimited text.
 
     format is rows: samples,
     cols: metrics/ confidence levels, etc
 
     data: array or 2d list.
+
+    convert_matching_names_to_zero : bool
+        If ``True``, each element in `data` with a matching row name and column
+        name will be converted to 0.0 if the original value is close to zero
+        (as determined by ``numpy.isclose(value, 0.0)``). This is a necessary
+        evil to handle distance matrices with diagonals that are close to zero.
+        We have to perform this conversion here instead of simply checking the
+        distance matrix's diagonal because parallelized beta diversity
+        calculations will compute subsets of the entire distance matrix and we
+        don't know where the true diagonal will fall. See
+        https://github.com/biocore/qiime/issues/1933 for details.
+
     """
     len_col = len(col_names)
     try:
@@ -308,8 +322,14 @@ def format_matrix(data, row_names, col_names):
     col_names = map(str, col_names)
     # just in case they weren't strings initially
     lines.append('\t'.join([''] + col_names))
-    for sam, vals in zip(row_names, data):
-        lines.append('\t'.join([sam] + map(str, vals)))
+    for row_name, values in zip(row_names, data):
+        line = [row_name]
+        for col_name, value in zip(col_names, values):
+            if convert_matching_names_to_zero and col_name == row_name and \
+                    numpy.isclose(value, 0.0):
+                value = 0.0
+            line.append(str(value))
+        lines.append('\t'.join(line))
     return '\n'.join(lines)
 
 

diff --git a/qiime/parallel/beta_diversity.py b/qiime/parallel/beta_diversity.py
@@ -244,7 +244,6 @@ def assemble_distance_matrix(dm_components):
     """ assemble distance matrix components into a complete dm string
 
     """
-    print "I get called."
     data = {}
     # iterate over compenents
     for c in dm_components:
@@ -269,7 +268,7 @@ def assemble_distance_matrix(dm_components):
     dm = []
     # construct the dm one row at a time
     for l1 in labels:
-        dm.append([data[l1][l2] for l2 in labels])
+        dm.append([float(data[l1][l2]) for l2 in labels])
     # create the dm string and return it
     dm = format_distance_matrix(labels, dm)
     return dm
diff --git a/scripts/observation_metadata_correlation.py b/scripts/observation_metadata_correlation.py
@@ -199,7 +199,7 @@ def main():
 
     # sort the biom table so that feature values are retrieved in the same
     # order as the metadata in the samples they correspond to
-    bt.sort(sort_f = lambda _: samples_to_correlate, axis='sample')
+    bt = bt.sort(sort_f = lambda _: samples_to_correlate, axis='sample')
 
     if bt.shape[1] <= 3:
         option_parser.error(filtration_error_text)