Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3 #524

luisfdez94 · 2020-11-26T10:04:38Z

Galaxy server : usegalaxy.eu
History link: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissue
Tool version : Galaxy Version 0.1.1

While executing Peptide Genomic Coordinate (following this Galaxy-P hands-on tutorial : [Tutorial3 : Novel peptides](https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/proteogenomics-novel-peptide-analysis/tutorial.html). See also Tutorial 1 : Database creation and Tutorial 2 : Database search), with our dataset (human species), the “Peptide Genomic Coordinate” tool returned an empty file.

I went through the source code; executing it locally in a « Debug » mode, I noticed that it didn’t enter in the « if » condition (in line 47), i.e. « coordinates » variable is empty at each iteration. However if I change line 41 : « acc = each[1] » for « acc = each[1].strip() » (trimming the spaces) it works. I noticed that, sometimes, proteins accession number (Ensembl ENSP) in the mz_to_sqlite input file, comes with a char at the end e.g. 'ENSP00000267884_A82P,P124A '. When, in line 44 (line when we do the query to fill « coordinates » variable), we did the matching with another tool’s input « Peptide_Genomic_Coordinate.sqlite », it does not work well because in this file, protein accessions do not contain this space e.g. 'ENSP00000267884_A82P,P124A'.

To help, I uploaded in the history the input files (data #1, #5 and #7) to execute Peptide Genomic Coordinate and the empty file, resulting from the execution on Galaxy of this tool (data #8). Also, I have uploaded the customized database (data #3) and the output file produced when I executed the tool locally with the modifications commented above (data #9 ).
Thank you for your help.

jj-umn · 2020-11-26T15:56:33Z

@luisfdez94 @subinamehta Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.

generic|ENSP00000355265 |5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]

Should have been:

generic|ENSP00000355265|5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]

luisfdez94 · 2020-11-26T17:56:07Z

Thank you for your fast answer.

Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.

There it goes: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissuedbcreation
As you can see at item #30, I also use the tool Regex Find And Replace to add generic| before each sequence in the customized DB (item #28): e.g. >ENSP00000355265 to >generic|ENSP00000355265 . I do this in order, SearchGui works.

jj-umn · 2020-11-30T15:32:06Z

@luisfdez94 @subinamehta The protein IDs from customProDB as a SPACE character at the end of the ID. I'm asking @chambm if that seems correct. Most steps in these workflows do not handle an ID ending with a SPACE. We could add a steps after customProDB using regex tools to remove the SPACE.

subinamehta · 2020-11-30T15:48:13Z

@JJ :I think the workflow already takes care of that

…

On Mon, Nov 30, 2020 at 10:32 AM Jim Johnson ***@***.***> wrote: @luisfdez94 <https://github.com/luisfdez94> @subinamehta <https://github.com/subinamehta> The protein IDs from customProDB as a SPACE character at the end of the ID. I'm asking @chambm <https://github.com/chambm> if that seems correct. Most steps in these workflows do not handle an ID ending with a SPACE. We could add a steps after customProDB using regex tools to remove the SPACE. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#524 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGP3A7LUONHRN3PLBLBCXBDSSO3INANCNFSM4UDQK4ZA> .

-- *Subina Mehta* Bioinformatics Researcher Dept. of Biochemistry, Molecular Biology and Biophysics University of Minnesota 7-166 MCB 420 Washington Ave SE Minneapolis, MN 55455 Lab: 612-624-0381 Phone: 612-500-8841 Email: *smehta@umn.edu <smehta@umn.edu>* *www.galaxyp.org* <http://www.galaxyp.org>

luisfdez94 · 2020-12-16T14:12:02Z

Thanks to @jj-umn and Galaxy-P team help, I have been able to solve this issue. I had to do some modifications to the headers of every sequence of the customized DB (fasta) obtained at the end of Galaxy-P Tutorial 1 : Database creation. For that purpose I have used Regex Find And Replace v1.0.0 tool with the parameters shown at the end of the message.

Delete spaces at the end of each protein's accession ID (for Ensembl-PRO and STRG database)[see checks 3 and 5 at the end of the message]
Indel and snv reformatting (coming from CustomProDB) [checks 1 and 2] This is also done in Galaxy-P Tutorial 1 : Database creation section Genomic mapping database In this way our genomic mapping is consistent with our protein database.
Add generic| prefix at the beginning of each header coming from Ensembl-PRO and STRG database (not standard format for SearchGUI). See http://compomics.github.io/projects/searchgui/wiki/DatabaseHelp [checks 1, 3, 4 and 5]

Another important point if you follow Galaxy-P hands on tutorials is to input this modified Custom DB to mz_to_sqlite tool at Tutorial 2: DB search! I was inputting the original custom DB.

Regex Find And Replace v1.0.0

check: from ">ENSP00000360709_1123:GAT>CAAT " to ">generic|ENSP00000360709_1123:GAT_CAAT"

“Find Regex”: >(ENS.*_\d+:)([ACGTacgt]+)>([ACGTacgt]+)\s*
"Replacement": >generic|\1\2_\3

check: from ">ENSP00000457107_D77*,L258P,L264P,S457G " to ">ENSP00000457107_D77*.L258P.L264P.S457G "

“Find Regex”: ([A-Z,*][0-9]+[A-Z,*]),
"Replacement”: \1.

check: delete the final espace at the end of a protein accession ID (Ensembl DB). Also add the prefix "generic". From "ENSP00000360709 " to ">generic|ENSP00000360709"

“Find Regex”: >ENS[A-Z]*(.*)\s\|
"Replacement”: >generic|ENSP\1|

check: from ">STRG00000058243|" to ">generic|STRG00000058243|"

“Find Regex”: >STRG(\S*)\|
"Replacement”: >generic|STRG\1|

check: from ">STRG00000058243 |" to ">generic|STRG00000058243|"

“Find Regex”: >STRG(.*)\s\|
"Replacement”: >generic|STRG\1|

luisfdez94 assigned jj-umn, PratikDJagtap and yvandenb Dec 9, 2020

luisfdez94 closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3 #524

Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3 #524

luisfdez94 commented Nov 26, 2020 •

edited by yvandenb

Loading

jj-umn commented Nov 26, 2020

luisfdez94 commented Nov 26, 2020 •

edited

Loading

jj-umn commented Nov 30, 2020

subinamehta commented Nov 30, 2020 via email

luisfdez94 commented Dec 16, 2020

Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3 #524

Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3 #524

Comments

luisfdez94 commented Nov 26, 2020 • edited by yvandenb Loading

jj-umn commented Nov 26, 2020

luisfdez94 commented Nov 26, 2020 • edited Loading

jj-umn commented Nov 30, 2020

subinamehta commented Nov 30, 2020 via email

luisfdez94 commented Dec 16, 2020

luisfdez94 commented Nov 26, 2020 •

edited by yvandenb

Loading

luisfdez94 commented Nov 26, 2020 •

edited

Loading