-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* feat: large update fix: info to metadata.update style: switch warning to info fix: perc and fixed instead of just fixed thresholds fix: perc and fixed instead of just fixed thresholds fix: hmmscan fix: these were modified because... these were modified because now the hmm file is auto check whether the GA threshold exists and to use it if it does fix: these were modified because these were modified because now the hmm file is auto check whether the GA threshold exists and to use it if it does fix: these were modified because... fix: normalize search filters to n queries fix: make id attributes consistent across classes feat: this is a breaking change for previously built database fix: bump ci/cv feat: add gene cluster obj and move cmap serialize feat: add bypass for protein id when no locus tag fix: single and multi group results fix: clustermap feat: serialize to gbk * fix: W605 invalid escape sequence
- Loading branch information
Showing
136 changed files
with
11,125 additions
and
1,854 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# a | ||
|
||
1. Take an input BGC | ||
2. Parse all the proteins in the input BGC | ||
3. Annotate every input BGC proteins with HMM models, either pulling from Neo4j if the exact protein exists, or with HMMER | ||
4. Get the list of all HMM models annotating input BGC proteins | ||
5. Query the database to find the outdegree (calculate/populate if necessary) for each HMM model. (:hmm)-[:ANNOTATES]->(:protein) | ||
6. Prioritize whcih input proteins to search based on the outdegree of its HMM annotations | ||
- max_query_proteins (float): Max proteins to return. If >0 and <1, will return X% of input proteins | ||
- max_domains_per_protein (int): Max domains to retain for each individual protein (highest outdegree dropped first) | ||
- max_outdegree (int): HMM model annotations with an outdegree higher than this will be dropped | ||
- scatter (bool, optional): Choose a random subset of proteins to search that are spread across the length of the input BGC. Defaults to False. | ||
- bypass (List[str], optional): List of locus tags that will bypass filtering. This is the ID found in a GenBank file "/locus_tag=" field. Defaults to None. | ||
- bypass_eid (List[str], optional): Less preferred than `bypass`. List of external protein IDs that will bypass filtering. This is the ID found in a GenBank file "/protein_id=" field. Defaults to None. | ||
7. Search the database for all proteins that have the same HMM model annotations as the input BGC proteins | ||
- Output from database is a data frame with columns: ['assembly_uid', 'nucleotide_uid', 'target', 'n_start', 'n_end', 'query'] | ||
8. The initial hits output is filtered based on the following criteria: | ||
- assemblies_must_have_x_matches (float): Minimum number of distinct query protein matches for a genome to be considered. <1 is a fraction of the number of query proteins, >1 is the number of query proteins. | ||
- nucleotide_sequences_must_have_x_matches (float): Minimum number of distinct query protein matches required for a nucleotide sequence to be considered. <1 is a fraction of the number of query proteins, >1 is the number of query proteins. | ||
9. All remaining loci are then assigned a cluster, simply moving from start to end and "breaking" into a new cluster if the distance between two loci is > `break_bgc_on_gap_of` | ||
10. Resulting clusters are then filtered to those that contain >= `gene_clusters_must_have_x_matches` | ||
11. Remaining clusters are pulled into the SocialGene object | ||
- All loci within each clister and +/- `target_bgc_padding` base pairs are pulled from the database | ||
- All proteins within each cluster are pulled from the database | ||
- All HMM models annotating proteins within each cluster are pulled from the database | ||
13. Each target BGC is compared again to the input BGC (via domain similarity) | ||
14. Links are created between the input BGC and the target BGCs | ||
15. "Groups" are created by assigning a target protein to a single input BGC protein based on best-hit | ||
- This is for assigning groups within clustermap.js for visualization purposes only | ||
16. Results are serialized to clustermap.js JSON format and written to disk |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,88 @@ | ||
import time | ||
from socialgene.search.hmmer import SearchDomains | ||
from socialgene.base.socialgene import SocialGene | ||
|
||
now = time.time() | ||
# The code is performing a series of operations using the SocialGene package and the SearchDomains | ||
# class from the socialgene.search.hmmer module. Here is a breakdown of what the code is doing: | ||
|
||
from socialgene.base.socialgene import SocialGene | ||
from socialgene.clustermap.serialize import SerializeToClustermap | ||
from socialgene.search.hmmer import SearchDomains | ||
|
||
input_gbk_path = "/home/chase/Documents/data/mibig/3_1/mibig_gbk_3.1/BGC0001850.gbk" | ||
input_gbk_path = "/home/chase/Documents/data/mibig/3_1/mibig_gbk_3.1/BGC0001646.gbk" | ||
# input_gbk_path = "/home/chase/Documents/data/mibig/3_1/mibig_gbk_3.1/BGC0001848.gbk" | ||
hmm_dir = None | ||
hmm_dir = ( | ||
"/home/chase/Documents/socialgene_data/streptomyces/socialgene_per_run/hmm_cache" | ||
) | ||
|
||
# hmm_dir = "/home/chase/Downloads/meh/socialgene_per_run/hmm_cache" | ||
hmm_dir = "/home/chase/Downloads/para/socialgene_per_run/hmm_cache" | ||
|
||
a = SocialGene() | ||
a.parse(input_gbk_path) | ||
len(a.proteins) | ||
|
||
search_object = SearchDomains( | ||
gbk_path=input_gbk_path, | ||
hmm_dir=hmm_dir, | ||
assemblies_must_have_x_matches=10, | ||
nucleotide_sequences_must_have_x_matches=10, | ||
gene_clusters_must_have_x_matches=10, | ||
use_neo4j_precalc=True, | ||
assemblies_must_have_x_matches=0.4, | ||
nucleotide_sequences_must_have_x_matches=0.4, | ||
gene_clusters_must_have_x_matches=0.2, | ||
break_bgc_on_gap_of=10000, | ||
target_bgc_padding=15000, | ||
) | ||
self = search_object | ||
|
||
search_object.outdegree_table | ||
|
||
search_object.outdegree_table() | ||
|
||
search_object.prioritize_input_proteins( | ||
max_domains_per_protein=1, | ||
max_domains_per_protein=None, | ||
max_outdegree=None, | ||
max_query_proteins=20, | ||
max_query_proteins=None, | ||
scatter=False, | ||
# loci=["MicB006_2899"] | ||
# bypass_eid=["AXA20096.1"], | ||
) | ||
search_object.outdegree_table() | ||
|
||
search_object.outdegree_table | ||
|
||
# TODO frac is redundant with the outdegree table | ||
search_object.search(only_culture_collection=False, frac=0.75) | ||
|
||
|
||
search_object.filter() | ||
search_object.label_clusters() | ||
|
||
|
||
search_object._bgc_regions_to_sg_object(self._primary_bgc_regions()) | ||
|
||
_ = self.sg_object.annotate_proteins_with_neo4j( | ||
protein_uids=None, annotate_all=True, progress=False | ||
) | ||
|
||
|
||
df = self._primary_bgc_regions() | ||
|
||
|
||
for i, row in df.iterrows(): | ||
self.sg_object.assemblies[row["assembly_uid"]].get_locus_by_uid( | ||
row["nucleotide_uid"] | ||
).add_bgcs_by_start_end( | ||
start=row["n_start"], | ||
end=row["n_end"], | ||
uid=row["cluster"], | ||
) | ||
|
||
|
||
temp = self.input_assembly.loci[self.input_bgc_id] | ||
|
||
temp.add_bgcs_by_feature(features=temp.features) | ||
|
||
search_object.search_db() | ||
|
||
search_object.process_results() | ||
self._create_links() | ||
self._choose_group() | ||
|
||
|
||
search_object.write_clustermap( | ||
"/home/chase/Downloads/clinker-master(2)/clinker-master/clinker/plot/data.json" | ||
z = SerializeToClustermap( | ||
sg_object=self.sg_object, | ||
bgc_order=list(self.sg_object.assemblies.keys()), | ||
link_df=self.link_df, | ||
group_df=self.group_df, | ||
) | ||
|
||
later = time.time() | ||
int(later - now) | ||
z.write("/home/chase/Downloads/clinker-master(2)/clinker-master/clinker/plot/data.json") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
import time | ||
|
||
from socialgene.clustermap.serialize import SerializeToClustermap | ||
from socialgene.search.hmmer import SearchDomains | ||
|
||
start_time = time.time() | ||
# input_gbk_path = "/home/chase/Downloads/para/a/GCF_002362315.1_ASM236231v1_genomic.gbff.gz" | ||
input_gbk_path = "/home/chase/Documents/data/mibig/3_1/mibig_gbk_3.1/BGC0001646.gbk" | ||
hmm_dir = None | ||
# hmm_dir = "/home/chase/Downloads/meh/socialgene_per_run/hmm_cache" | ||
hmm_dir = "/home/chase/Downloads/para/socialgene_per_run/hmm_cache" | ||
|
||
|
||
search_object = SearchDomains( | ||
gbk_path=input_gbk_path, | ||
hmm_dir=hmm_dir, | ||
use_neo4j_precalc=True, | ||
assemblies_must_have_x_matches=0.6, | ||
nucleotide_sequences_must_have_x_matches=0.6, | ||
gene_clusters_must_have_x_matches=0.6, | ||
break_bgc_on_gap_of=10000, | ||
target_bgc_padding=15000, | ||
) | ||
|
||
|
||
# search_object.outdegree_table | ||
|
||
|
||
search_object.prioritize_input_proteins( | ||
max_domains_per_protein=2, | ||
max_outdegree=100000, | ||
max_query_proteins=5, | ||
scatter=False, | ||
# bypass_locus=["MicB006_2899"] | ||
# bypass_pid=["AXA20096.1"], | ||
) | ||
|
||
# search_object.outdegree_table | ||
|
||
# TODO frac is redundant with the outdegree table | ||
search_object.search(only_culture_collection=False, frac=0.75) | ||
|
||
|
||
search_object.filter() | ||
search_object.label_clusters() | ||
|
||
|
||
search_object._bgc_regions_to_sg_object(search_object._primary_bgc_regions()) | ||
|
||
_ = search_object.sg_object.annotate_proteins_with_neo4j( | ||
protein_uids=None, annotate_all=True, progress=False | ||
) | ||
|
||
|
||
df = search_object._primary_bgc_regions() | ||
|
||
|
||
for i, row in df.iterrows(): | ||
search_object.sg_object.assemblies[row["assembly_uid"]].get_locus_by_uid( | ||
row["nucleotide_uid"] | ||
).add_bgcs_by_start_end( | ||
start=row["n_start"], | ||
end=row["n_end"], | ||
uid=row["cluster"], | ||
) | ||
|
||
|
||
temp = search_object.input_assembly.loci[search_object.input_bgc_id] | ||
temp.add_bgcs_by_feature(features=temp.features) | ||
|
||
search_object._create_links() | ||
search_object._choose_group() | ||
|
||
|
||
z = SerializeToClustermap( | ||
sg_object=search_object.sg_object, | ||
bgc_order=list(search_object.sg_object.assemblies.keys()), | ||
link_df=search_object.link_df, | ||
group_df=search_object.group_df, | ||
) | ||
|
||
z.write("/home/chase/Downloads/clinker-master(2)/clinker-master/clinker/plot/data.json") | ||
print("--- %s seconds ---" % (time.time() - start_time)) | ||
|
||
# search_object.rich_table(search_object.user_friendly_hit_df()) | ||
|
||
|
||
# self = search_object |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.