-
Notifications
You must be signed in to change notification settings - Fork 0
3‐Matlab‐version
result = Find_NRPS_motif_module_pfam_HRL(seq,Pfam_database_path,reference_motif_path,new_motif,loop_S_code_judge,length_threshold,length_threshold_TE)
seq
is the reuslt of matlab function fastaread
. It's a struct
with two fields: Sequence
and Header
.
Pfam_database_path
is path of Pfam database constructed by hmmpress. You has contructed in here. (User can set default value of Pfam_database_path in the function.)
reference_motif_path
is the path of reference motif file folder. You can download in here. (User can set default value of reference_motif_path in the function.)
new_motif
: If you want to get all new motifs, you can set new_motif
as {'Aalpha','G','Talpha'}
. If you don't want to get any new motifs, you can set new_motif
as {}
. Or partial new motif. (default:{}
). New motifs were proposed in our paper. 'Aalpha' and 'G' motifs are in the A domain. 'Talpha' motif is in the T domain.
loop_S_code_judge
: If loop_S_code_judge
=1, function will calculate loop length, loop group and return loop sequence and S codes. You can set loop_S_code_judge
=0 to skip calculation.(default:0
)
length_threshold
: used in hmmer scan for C/A/T domain. Value is in the range of 0~1. (default:0.6
) User can reduce the threshold, if miss some known domains.
length_threshold_TE
: used in hmmer scan for TE domain. Value is in the range of 0~1. (default:0.5
) User can reduce the threshold, if miss some known domains.
Also see `Output' in example.mlx.
result
is a n*1 struct with following fields (n is the number of input sequences):
Header
:the Header in the input seq
.
domain_list
:show the domain compositions of input NRPS sequence. domain id: C:1 A:2 T:3 E:4 Te:5
motifid_mat
:display the motif type in seq_list
. motif number: C:7 A:10(+ 0~2)
T:1(+ 0~1)
E:7 Te:1 details see reference_motif/reference_motif.xlsx
seq_list
:sequence corresponding with motifid_mat
.
index_mat
:sequence index for each row in motifid_mat
.
C_subtype_list
: the most possible domain subtype for each C domain in sequence.
C_score
: HMM profile alignment score for C subtype. If C_score
>= 200, prediction is credible. If C_score
< 200, it should be careful with the results.
Raw_C_score_list
: HMM profile alignment score for each C subtype.
C_subtype_str
: string of C subtype in C_subtype_list
.
loop_length
: length of 5 loops:[A3-A4),[A4,S4),[S4,S6),[S6,A5),[A5,G). [
means inclusion, )
mean exclusion. [A3-A4)
means the length includes A3 but not A4.
loop_seq
: sequence of 5 loops.
S_code
: Stachelhaus code proposed by Torsten Stachelhaus in 1999.
loop_group
: There are 5 loop groups. Each row is one A domain.
Loop length and loop group are proposed in our paper. And we found loop group is related with A domain substrate specificity.
No blanks in your matltab script path, Pfam_database_path and reference_motif_path
Header of input sequence can't strat with blank, that is, ' '. 'X' in your sequence will be removed.
This function only works for C/A/T/E/Te domain. And other domains will be intermotif between C/A/T/E/Te domain.
-
/*
/X
in the sequence will be removed, and the index in index_mat
will be influenced (shorter than the original). If domain is incomplete, but the length is more than 0.6* full length, we would put NaN
in index_mat
and ''
in seq_list
. If index is out of range, we would put '1' in index_mat
.
There is an example to show how to use: example.mlx
The test data: test.fasta