Skip to content

3‐Matlab‐version

Ruolin He edited this page Aug 28, 2024 · 4 revisions

Core function

result = Find_NRPS_motif_module_pfam_HRL(seq,Pfam_database_path,reference_motif_path,new_motif,loop_S_code_judge,length_threshold,length_threshold_TE)

Input:

seq is the reuslt of matlab function fastaread. It's a struct with two fields: Sequence and Header.

Waring: The headers of input fasta file must be different!!!

Pfam_database_path is path of Pfam database constructed by hmmpress. You has contructed in here. (User can set default value of Pfam_database_path in the function.)

reference_motif_path is the path of reference motif file folder. You can download in here. (User can set default value of reference_motif_path in the function.)

new_motif: If you want to get all new motifs, you can set new_motif as {'Aalpha','G','Talpha'}. If you don't want to get any new motifs, you can set new_motif as {}. Or partial new motif. (default:{}). New motifs were proposed in our paper. 'Aalpha' and 'G' motifs are in the A domain. 'Talpha' motif is in the T domain.

loop_S_code_judge: If loop_S_code_judge=1, function will calculate loop length, loop group and return loop sequence and S codes. You can set loop_S_code_judge=0 to skip calculation.(default:0)

length_threshold: used in hmmer scan for C/A/T domain. Value is in the range of 0~1. (default:0.6) User can reduce the threshold, if miss some known domains.

length_threshold_TE: used in hmmer scan for TE domain. Value is in the range of 0~1. (default:0.5) User can reduce the threshold, if miss some known domains.

Output

Also see `Output' in example.mlx.

result is a n*1 struct with following fields (n is the number of input sequences):

Header:the Header in the input seq.

domain_list:show the domain compositions of input NRPS sequence. domain id: C:1 A:2 T:3 E:4 Te:5

motifid_mat:display the motif type in seq_list. motif number: C:7 A:10(+ 0~2) T:1(+ 0~1) E:7 Te:1 details see reference_motif/reference_motif.xlsx

seq_list:sequence corresponding with motifid_mat.

index_mat:sequence index for each row in motifid_mat.

C_subtype_list: the most possible domain subtype for each C domain in sequence.

C_score: HMM profile alignment score for C subtype. If C_score >= 200, prediction is credible. If C_score < 200, it should be careful with the results.

Raw_C_score_list: HMM profile alignment score for each C subtype.

C_subtype_str: string of C subtype in C_subtype_list. Details for C domain subtype see here.

loop_length: length of 5 loops:[A3-A4),[A4,S4),[S4,S6),[S6,A5),[A5,G). [ means inclusion, ) mean exclusion. [A3-A4) means the length includes A3 but not A4.

loop_seq: sequence of 5 loops.

S_code: Stachelhaus code proposed by Torsten Stachelhaus in 1999.

loop_group: There are 5 loop groups. Each row is one A domain.

Loop length and loop group are proposed in our paper. And we found loop group is related with A domain substrate specificity.

Note

Path

No blanks in your matltab script path, Pfam_database_path and reference_motif_path

Sequence

Header of input sequence can't strat with blank, that is, ' '. 'X' in your sequence will be removed.

Domain

This function only works for C/A/T/E/Te domain. And other domains will be intermotif between C/A/T/E/Te domain.

Result

-/*/X in the sequence will be removed, and the index in index_mat will be influenced (shorter than the original). If domain is incomplete, but the length is more than 0.6* full length, we would put NaN in index_mat and '' in seq_list. If index is out of range, we would put '1' in index_mat.

Example

There is an example to show how to use: example.mlx

The test data: test.fasta

Clone this wiki locally