Skip to content

Commit

Permalink
Merge pull request #10 from JLSteenwyk/new_python_and_biopython_versions
Browse files Browse the repository at this point in the history
created a new argument to report how inparalogs are handled
  • Loading branch information
JLSteenwyk authored Dec 18, 2023
2 parents 04405fb + c6bed68 commit e70c4d7
Show file tree
Hide file tree
Showing 16 changed files with 793 additions and 544 deletions.
4 changes: 4 additions & 0 deletions docs/change_log/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ Change log

Major changes to OrthoSNAP are summarized here.

**1.2.0**
Added the -rih (\-\-report_inparalog_handling) function, which creates
a summary file of which inparalogs where kept compared to trimmed

**0.1.0**
Added -r/\-\-rooted, -st/\-\-snap_trees, and -ip/\-\-inparalog_to_keep functions

Expand Down
62 changes: 40 additions & 22 deletions docs/usage/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,29 +81,47 @@ or the inparalog with the median sequence length can be kept using the following
Again, following transcriptomics, the default is to keep the longest sequence because (at least in theory)
it is the most complete gene annotation.

Report inparalog handling
-------------------------
To report inparalogs and specify which was kept per SNAP-OG, use the -rih, \-\-report_inparalog_handling
argument. The resulting file, which will have the suffix ".inparalog_report.txt," will have three columns: |br|
- col 1 is the orthogroup file |br|
- col 2 is the inparalog that was kept |br|
- col 3 is/are the inparalog/s that were trimmed separated by a semi-colon ";" |br|

To generate this file, use the following command:

.. code-block:: shell
$ orthosnap -f orthogroup_of_genes.faa -t phylogeny_of_orthogroup_of_genes.tre -rih
|
All options
-----------

+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Usage and meaning |
+=============================+==============================================================================================================================================+
| -h/\-\-help | Print help message |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -v/\-\-version | Print software version |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -t/\-\-tree | Input tree file (format: newick) |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -s/\-\-support | Bipartition support threshold for collapsing uncertain branches (default: 80) |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -o/\-\-occupancy | Occupancy threshold for identifying a subgroup of interest (default: 50%) |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -r/\-\-roooted | boolean argument for whether the input phylogeny is already rooted (default: false) |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -st/\-\-snap_trees | boolean argument for whether trees of SNAP-OGs should be outputted (default: false) |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -ip/\-\-inparalog_to_keep | determine which sequence to keep in the case of species-specific inparalogs using sequence- or tree-based options (default: longest_seq_len) |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -op/\-\-output_path | pathway for output files to be written (default: same as -f input) |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Usage and meaning |
+=====================================+==============================================================================================================================================+
| -h/\-\-help | Print help message |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -v/\-\-version | Print software version |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -t/\-\-tree | Input tree file (format: newick) |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -s/\-\-support | Bipartition support threshold for collapsing uncertain branches (default: 80) |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -o/\-\-occupancy | Occupancy threshold for identifying a subgroup of interest (default: 50%) |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -r/\-\-roooted | boolean argument for whether the input phylogeny is already rooted (default: false) |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -st/\-\-snap_trees | boolean argument for whether trees of SNAP-OGs should be outputted (default: false) |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -ip/\-\-inparalog_to_keep | determine which sequence to keep in the case of species-specific inparalogs using sequence- or tree-based options (default: longest_seq_len) |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -op/\-\-output_path | pathway for output files to be written (default: same as -f input) |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| -rih, \-\-report_inparalog_handling | create a summary file of which inparalogs where kept compared to trimmed |
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
*For genome-scale analyses, we recommend changing the -o/\-\-occupancy parameter to be the same for all large gene families so that the minimum SNAP-OG occupancy is the same
for all SNAP-OGs.
for all SNAP-OGs.
2 changes: 2 additions & 0 deletions orthosnap/args_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def process_args(args) -> dict:

rooted = args.rooted
snap_trees = args.snap_trees
report_inparalog_handling = args.report_inparalog_handling

if args.output_path:
output_path = args.output_path
Expand All @@ -71,6 +72,7 @@ def process_args(args) -> dict:
rooted=rooted,
snap_trees=snap_trees,
inparalog_to_keep=inparalog_to_keep,
report_inparalog_handling=report_inparalog_handling,
output_path=output_path,
)

Expand Down
101 changes: 83 additions & 18 deletions orthosnap/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,10 +126,6 @@ def get_subtree_tips(terms: list, name: str, tree):
temp.append(term.name)
subtree_tips.append(temp)

# print(name)
# import sys
# sys.exit()

return subtree_tips, dups


Expand All @@ -147,6 +143,8 @@ def handle_multi_copy_subtree(
snap_trees: bool,
inparalog_to_keep: InparalogToKeep,
output_path: str,
inparalog_handling: dict,
inparalog_handling_summary: dict,
):
"""
handling case where subtree contains all single copy genes
Expand All @@ -172,17 +170,22 @@ def handle_multi_copy_subtree(
# if duplicate sequences are sister, get the longest sequence
if are_sisters:
# trim short sequences and keep long sequences in newtree
newtree, terms = inparalog_to_keep_determination(
newtree, fasta_dict, dups, terms, inparalog_to_keep
)
newtree, terms, inparalog_handling = \
inparalog_to_keep_determination(
newtree, fasta_dict, dups, terms,
inparalog_to_keep, inparalog_handling
)

# if the resulting subtree has only single copy genes
# create a fasta file with sequences from tip labels
_, _, _, counts = get_tips_and_taxa_names_and_taxa_counts_from_subtrees(newtree)
_, _, _, counts = \
get_tips_and_taxa_names_and_taxa_counts_from_subtrees(newtree)

if set(counts) == set([1]):
(
subgroup_counter,
assigned_tips,
inparalog_handling_summary
) = write_output_fasta_and_account_for_assigned_tips_single_copy_case(
fasta,
subgroup_counter,
Expand All @@ -192,9 +195,13 @@ def handle_multi_copy_subtree(
snap_trees,
newtree,
output_path,
inparalog_handling,
inparalog_handling_summary,
)

return subgroup_counter, assigned_tips
return \
subgroup_counter, assigned_tips, \
inparalog_handling, inparalog_handling_summary


def handle_single_copy_subtree(
Expand All @@ -208,6 +215,8 @@ def handle_single_copy_subtree(
assigned_tips: list,
snap_trees: bool,
output_path: str,
inparalog_handling: dict,
inparalog_handling_summary: dict,
):
"""
handling case where subtree contains all single copy genes
Expand All @@ -223,6 +232,7 @@ def handle_single_copy_subtree(
(
subgroup_counter,
assigned_tips,
inparalog_handling_summary
) = write_output_fasta_and_account_for_assigned_tips_single_copy_case(
fasta,
subgroup_counter,
Expand All @@ -232,9 +242,13 @@ def handle_single_copy_subtree(
snap_trees,
newtree,
output_path,
inparalog_handling,
inparalog_handling_summary,
)

return subgroup_counter, assigned_tips
return \
subgroup_counter, assigned_tips, \
inparalog_handling, inparalog_handling_summary


def inparalog_to_keep_determination(
Expand All @@ -243,6 +257,7 @@ def inparalog_to_keep_determination(
dups: list,
terms: list,
inparalog_to_keep: InparalogToKeep,
inparalog_handling: dict,
):
"""
remove_short_sequences_among_duplicates_that_are_sister
Expand All @@ -261,22 +276,27 @@ def inparalog_to_keep_determination(
seq_to_keep = min(lengths, key=lengths.get)
elif len(lengths) > 2 and inparalog_to_keep.value == "median_seq_len":
median_len = stat.median(lengths, key=lengths.get)
seq_to_keep = [key for key, value in lengths if value == median_len]
seq_to_keep = [
key for key, value in lengths if value == median_len
]
elif len(lengths) == 2 and inparalog_to_keep.value == "median_seq_len":
seq_to_keep = max(lengths, key=lengths.get)
elif inparalog_to_keep.value == "longest_seq_len":
seq_to_keep = max(lengths, key=lengths.get)

# keep inparalog based on tip to root length
else:
for dup in dups:
lengths[dup] = TreeMixin.distance(newtree, dup)
if inparalog_to_keep.value == "shortest_branch_len":
seq_to_keep = min(lengths, key=lengths.get)
elif len(lengths) > 2 and inparalog_to_keep.value == "median_branch_len":
elif len(lengths) > 2 and \
inparalog_to_keep.value == "median_branch_len":
median_len = stat.median(lengths, key=lengths.get)
seq_to_keep = [key for key, value in lengths if value == median_len]
elif len(lengths) == 2 and inparalog_to_keep.value == "median_branch_len":
seq_to_keep = [
key for key, value in lengths if value == median_len
]
elif len(lengths) == 2 and \
inparalog_to_keep.value == "median_branch_len":
seq_to_keep = max(lengths, key=lengths.get)
elif inparalog_to_keep.value == "longest_branch_len":
seq_to_keep = max(lengths, key=lengths.get)
Expand All @@ -288,15 +308,20 @@ def inparalog_to_keep_determination(
newtree.prune(seq_name)
terms.remove(seq_name)

return newtree, terms
dups.remove(seq_to_keep)
inparalog_handling[seq_to_keep] = dups

return newtree, terms, inparalog_handling


def prune_subtree(all_tips: list, terms: list, newtree):
"""
prune tips not of interest for subtree
"""

tips_to_prune = [i for i in all_tips + terms if i not in all_tips or i not in terms]
tips_to_prune = [
i for i in all_tips + terms if i not in all_tips or i not in terms
]

for tip in tips_to_prune:
newtree.prune(tip)
Expand Down Expand Up @@ -328,6 +353,8 @@ def write_output_fasta_and_account_for_assigned_tips_single_copy_case(
snap_tree: bool,
newtree,
output_path: str,
inparalog_handling: dict,
inparalog_handling_summary: dict
):
# write output
fasta_path_stripped = re.sub("^.*/", "", fasta)
Expand All @@ -345,6 +372,44 @@ def write_output_fasta_and_account_for_assigned_tips_single_copy_case(
)
Phylo.write(newtree, output_file_name, "newick")

write_summary_file_with_inparalog_handling(
inparalog_handling, fasta,
output_path, subgroup_counter,
assigned_tips
)
subgroup_counter += 1

return subgroup_counter, assigned_tips
return subgroup_counter, assigned_tips, inparalog_handling_summary


def write_summary_file_with_inparalog_handling(
inparalog_handling: dict,
fasta: str,
output_path: str,
subgroup_count: int,
assigned_tips: list
):
res_arr = []

in_file_handle = re.sub("^.*/", "", fasta)

for k, v in inparalog_handling.items():
temp = []
temp.append(in_file_handle+".orthosnap."+str(subgroup_count))
temp.append(k)
temp.append(';'.join(v))
res_arr.append(temp)
inparalog_report_output_name = in_file_handle + ".inparalog_report.txt"

fasta_path_stripped = re.sub("^.*/", "", fasta)
output_fasta_file_name = (
f"{output_path}/{fasta_path_stripped}.orthosnap.{subgroup_count}.fa"
)

if res_arr:
try:
if res_arr[0][1] in open(output_fasta_file_name).read():
with open(f"{output_path}{inparalog_report_output_name}", "a") as file:
file.writelines('\t'.join(i) + '\n' for i in res_arr)
except FileNotFoundError:
1
Loading

0 comments on commit e70c4d7

Please sign in to comment.