Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

Commit

Permalink
Merge pull request #2028 from jairideout/issue-1933
Browse files Browse the repository at this point in the history
Fixes #1933
  • Loading branch information
gregcaporaso committed May 22, 2015
2 parents 45ea5cd + 2c21109 commit 8c5a051
Show file tree
Hide file tree
Showing 7 changed files with 53 additions and 325 deletions.
9 changes: 6 additions & 3 deletions ChangeLog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,18 @@ QIIME 1.9.0-dev
Bug fixes
---------

* Updated minimum required version of the [qiime-default-reference](http://github.com/biocore/qiime-default-reference) package to 0.1.2. This release includes an important bug fix described in more detail in [this QIIME blog post](https://qiime.wordpress.com/2015/04/15/qiime-1-9-0-bug-affecting-pynast-alignment-of-16s-amplicons-generated-with-non-515f806r-primers/) and in [biocore/qiime-default-reference#14](https://github.com/biocore/qiime-default-reference/issues/14).
* Updated minimum required version of the [qiime-default-reference](http://github.com/biocore/qiime-default-reference) package to 0.1.2. **This release includes an important bug fix described in more detail in [this QIIME blog post](https://qiime.wordpress.com/2015/04/15/qiime-1-9-0-bug-affecting-pynast-alignment-of-16s-amplicons-generated-with-non-515f806r-primers/) and in [biocore/qiime-default-reference#14](https://github.com/biocore/qiime-default-reference/issues/14).**
* Fixed bug in ``differential_abundance.py`` fitZIG algorithm ([#1960](https://github.com/biocore/qiime/pull/1960)). **This was a serious bug that was encountered when users would call ``differential_abundance.py -a metagenomeSeq_fitZIG``. Any results previosuly generated with that command should be re-run.**
* Fixed serious bug in ``observation_metadata_correlation.py``, described in [#2009](https://github.com/biocore/qiime/issues/2009). **All previous output generated with ``observation_metadata_correlation.py`` was incorrect, and analyses using those results should be re-run.** This most commonly would have resulted in massive Type 2 error (false negatives), where observations whose abundance is correlated with metadata are not reported, though Type 1 error (false positives) are also possible.
* ``count_seqs.py`` no longer fails on empty files. [#1991](https://github.com/biocore/qiime/issues/1991)
* Fixed bug in ``differential_abundance.py`` fitZIG algorithm ([#1960](https://github.com/biocore/qiime/pull/1960)). This was a serious bug that was encountered when users would call ``differential_abundance.py -a metagenomeSeq_fitZIG``. Any results generated with that command in QIIME 1.9.0 should be re-run.
* Updated minimum required version of [biom-format](http://github.com/biocore/biom-format) package to 2.1.4. This is a bug fix release. Details are available in the [biom-format ChangeLog](https://github.com/biocore/biom-format/blob/master/ChangeLog.md).
* Updated minimum required version of [Emperor](http://github.com/biocore/emperor) package to 0.9.51.
* Forced BIOM table type to "OTU table" for all tables written with QIIME. This fixes [#1928](https://github.com/biocore/qiime/issues/1928).
* The ``--similarity`` option in ``pick_otus.py`` now only accepts sequence similarity thresholds between 0.0 and 1.0 (inclusive). Previous behavior would allow values outside this range, which would cause uninformative error messages to be raised by the external tools that ``pick_otus.py`` wraps ([#1979](https://github.com/biocore/qiime/issues/1979)).
* ``split_libraries_fastq.py`` now explicitly disallows ``-p 0``. This could lead to empty sequences being written to the resulting output file ([#1984](https://github.com/biocore/qiime/issues/1984)).
* Fixed issued where ``filter_samples_from_otu_table.py`` could only filter the mapping file when ``--valid_states`` was passed as the filtering method ([#2003](https://github.com/biocore/qiime/issues/2003)).
* Fixed issued where ``filter_samples_from_otu_table.py`` could only filter the mapping file when ``--valid_states`` was passed as the filtering method ([#2003](https://github.com/biocore/qiime/issues/2003)).
* Fixed bug where distance matrix files generated by QIIME (e.g., using ``beta_diversity.py``) could have diagonals with values that were close to zero in rare cases (depending on input data, machine architecture, installed dependencies, etc.). These files could not be loaded by QIIME scripts that accepted distance matrix files as input (e.g., ``principal_coordinates.py``) and would result in an error message stating that the distance matrix was not hollow. Values on the diagonal that are close to zero are now set to 0.0 ([#1933](https://github.com/biocore/qiime/issues/1933)).


Usability enhancements
----------------------
Expand Down
180 changes: 4 additions & 176 deletions qiime/beta_diversity.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,74 +122,6 @@ def list_known_metrics():
list_known_phylogenetic_metrics()


class BetaDiversityCalc(FunctionWithParams):

"""A BetaDiversityCalc takes taxon x sample counts, returns distance mat.
Some BetaDiversityCalcs also require a tree.
typical usage is:
calc = BetaDiversityCalc(my_metric, 'metric name', \
IsPhylogenetic=False)
calc(data_path=stuff, result_path=stuff2)
"""
_tracked_properties = ['IsPhylogenetic', 'Metric']

def __init__(self, metric, name, is_phylogenetic=False, params=None):
"""Return new BetaDiversityCalc object with specified params.
Note: metric must conform to the following interface:
must return 2d sample distance matrix with sample order preserved.
* phylogenetic methods: take samples by taxa array, taxon_names,
cogent PhyloNode tree
* nonphylo methods: take only samples by taxa array, numpy 2d
Not sure what params does
"""
self.Metric = metric # should be f(table, tree) -> dist matrix
self.Name = name
self.IsPhylogenetic = is_phylogenetic
self.Params = params or {}

def getResult(self, data_path, tree_path=None):
"""Returns distance matrix from (indcidence matrix and optionally tree).
Parameters:
data_path: path to data file, matrix (samples = cols, taxa = rows)
in tab-delimited text format
tree_path: path or object.
if method is phylogenetic, must supply tree_path.
if path, path to
Newick-format tree file where taxon ids match taxon ids in the
input data file.
returns 2d dist matrix, list of sample names ordered as in dist mtx
"""
# if it's a phylogenetic metric, read the tree
if self.IsPhylogenetic:
tree = self.getTree(tree_path)
else:
tree = None

otu_table = load_table(data_path)
otumtx = asarray([v for v in otu_table.iter_data(axis='sample')])

# get the 2d dist matrix from beta diversity analysis
if self.IsPhylogenetic:
return (self.Metric(otumtx, otu_table.ids(axis='observation'),
tree, otu_table.ids()),
list(otu_table.ids()))
else:
return self.Metric(otumtx), list(otu_table.ids())

def formatResult(self, result):
"""Generate formatted distance matrix. result is (data, sample_names)"""
data, sample_names = result
return format_distance_matrix(sample_names, data)


def single_file_beta(input_path, metrics, tree_path, output_dir,
rowids=None, full_tree=False):
""" does beta diversity calc on a single otu table
Expand Down Expand Up @@ -293,114 +225,10 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
make_subtree=(not full_tree))
row_dissims.append(dissims)

# rows_outfilepath = os.path.join(output_dir, metric + '_' +\
# '_'.join(rowids_list) + '_' + os.path.split(input_path)[1])
f = open(outfilepath, 'w')
f.write(
format_matrix(
row_dissims,
rowids_list,
otu_table.ids()))
f.close()


def single_object_beta(otu_table, metrics, tr, rowids=None,
full_tree=False):
"""mod of single_file_beta to recieve and return otu obj, tree str
uses name in metrics to name output beta diversity files
assumes input tree is already trimmed to contain only otus present
in otu_table, doesn't call getSubTree()
inputs:
otu_table -- a otu_table in the biom format
metrics -- metrics (str, comma delimited if more than 1 metric)
tr -- a phylonode cogent tree object if needed by the chosen beta
diversity metric
rowids -- comma seperated string
"""
otumtx = asarray([v for v in otu_table.iter_sample_data()])

if tr:
tree = tr
else:
tree = None

metrics_list = metrics.split(',')

for metric in metrics_list:
try:
metric_f = get_nonphylogenetic_metric(metric)
is_phylogenetic = False
except AttributeError:
try:
metric_f = get_phylogenetic_metric(metric)
is_phylogenetic = True
if tree is None:
stderr.write("metric %s requires a tree, but none found\n"
% (metric,))
exit(1)
except AttributeError:
stderr.write("Could not find metric %s.\n\nKnown metrics are: %s\n"
% (metric, ', '.join(list_known_metrics())))
exit(1)
if rowids is None:
# standard, full way
if is_phylogenetic:
dissims = metric_f(otumtx, otu_table.ids(axis='observation'),
tree, otu_table.ids(),
make_subtree=(not full_tree))
else:
dissims = metric_f(otumtx)

return (
format_distance_matrix(
otu_table.ids(),
dissims).split(
'\n')
)
else:
# only calc d(rowid1, *) for each rowid
rowids_list = rowids.split(',')
row_dissims = [] # same order as rowids_list
for rowid in rowids_list:
rowidx = otu_table.index(rowid)

# first test if we can the dissim is a fn of only the pair
# if not, just calc the whole matrix
if metric_f.__name__ == 'dist_chisq' or \
metric_f.__name__ == 'dist_gower' or \
metric_f.__name__ == 'dist_hellinger' or\
metric_f.__name__ == 'binary_dist_chisq':
warnings.warn('dissimilarity ' + metric_f.__name__ +
' is not parallelized, calculating the whole matrix...')
row_dissims.append(metric_f(otumtx)[rowidx])
else:
try:
row_metric = get_phylogenetic_row_metric(metric)
except AttributeError:
# do element by element
dissims = []
sample_ids = otu_table.ids()
observation_ids = otu_table.ids(axis='observation')
for i in range(len(sample_ids)):
if is_phylogenetic:
dissim = metric_f(
otumtx[[rowidx, i], :], observation_ids,
tree, [sample_ids[rowidx], sample_ids[i]],
make_subtree=(not full_tree))[0, 1]
else:
dissim = metric_f(otumtx[[rowidx, i], :])[0, 1]
dissims.append(dissim)
row_dissims.append(dissims)
else:
# do whole row at once
dissims = row_metric(otumtx,
otu_table.ids(axis='observation'),
tree, otu_table.ids(), rowid,
make_subtree=(not full_tree))
row_dissims.append(dissims)

return format_matrix(row_dissims, rowids_list, otu_table.ids())
with open(outfilepath, 'w') as f:
f.write(format_matrix(row_dissims, rowids_list,
otu_table.ids(),
convert_matching_names_to_zero=True))


def multiple_file_beta(input_path, output_dir, metrics, tree_path,
Expand Down
30 changes: 25 additions & 5 deletions qiime/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ def format_qiime_parameters(params, header="#QIIME parameters"):

qiime_params.append(full_line)
return qiime_params


def format_add_taxa_summary_mapping(summary, tax_order, mapping, header,
delimiter=';'):
Expand Down Expand Up @@ -275,16 +275,30 @@ def format_otu_map(otu_map, otu_id_prefix):

def format_distance_matrix(labels, data):
"""Writes distance matrix as tab-delimited text, uses format_matrix"""
return format_matrix(data, labels, labels)
return format_matrix(data, labels, labels,
convert_matching_names_to_zero=True)


def format_matrix(data, row_names, col_names):
def format_matrix(data, row_names, col_names,
convert_matching_names_to_zero=False):
"""Writes matrix as tab-delimited text.
format is rows: samples,
cols: metrics/ confidence levels, etc
data: array or 2d list.
convert_matching_names_to_zero : bool
If ``True``, each element in `data` with a matching row name and column
name will be converted to 0.0 if the original value is close to zero
(as determined by ``numpy.isclose(value, 0.0)``). This is a necessary
evil to handle distance matrices with diagonals that are close to zero.
We have to perform this conversion here instead of simply checking the
distance matrix's diagonal because parallelized beta diversity
calculations will compute subsets of the entire distance matrix and we
don't know where the true diagonal will fall. See
https://github.com/biocore/qiime/issues/1933 for details.
"""
len_col = len(col_names)
try:
Expand All @@ -308,8 +322,14 @@ def format_matrix(data, row_names, col_names):
col_names = map(str, col_names)
# just in case they weren't strings initially
lines.append('\t'.join([''] + col_names))
for sam, vals in zip(row_names, data):
lines.append('\t'.join([sam] + map(str, vals)))
for row_name, values in zip(row_names, data):
line = [row_name]
for col_name, value in zip(col_names, values):
if convert_matching_names_to_zero and col_name == row_name and \
numpy.isclose(value, 0.0):
value = 0.0
line.append(str(value))
lines.append('\t'.join(line))
return '\n'.join(lines)


Expand Down
3 changes: 1 addition & 2 deletions qiime/parallel/beta_diversity.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,6 @@ def assemble_distance_matrix(dm_components):
""" assemble distance matrix components into a complete dm string
"""
print "I get called."
data = {}
# iterate over compenents
for c in dm_components:
Expand All @@ -269,7 +268,7 @@ def assemble_distance_matrix(dm_components):
dm = []
# construct the dm one row at a time
for l1 in labels:
dm.append([data[l1][l2] for l2 in labels])
dm.append([float(data[l1][l2]) for l2 in labels])
# create the dm string and return it
dm = format_distance_matrix(labels, dm)
return dm
2 changes: 1 addition & 1 deletion scripts/observation_metadata_correlation.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ def main():

# sort the biom table so that feature values are retrieved in the same
# order as the metadata in the samples they correspond to
bt.sort(sort_f = lambda _: samples_to_correlate, axis='sample')
bt = bt.sort(sort_f = lambda _: samples_to_correlate, axis='sample')

if bt.shape[1] <= 3:
option_parser.error(filtration_error_text)
Expand Down
Loading

0 comments on commit 8c5a051

Please sign in to comment.