Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3 #524

Closed
luisfdez94 opened this issue Nov 26, 2020 · 5 comments
Assignees

Comments

@luisfdez94
Copy link

luisfdez94 commented Nov 26, 2020

Galaxy server : usegalaxy.eu
History link: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissue
Tool version : Galaxy Version 0.1.1

While executing Peptide Genomic Coordinate (following this Galaxy-P hands-on tutorial : [Tutorial3 : Novel peptides](https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/proteogenomics-novel-peptide-analysis/tutorial.html). See also Tutorial 1 : Database creation and Tutorial 2 : Database search), with our dataset (human species), the “Peptide Genomic Coordinate” tool returned an empty file.

I went through the source code; executing it locally in a « Debug » mode, I noticed that it didn’t enter in the « if » condition (in line 47), i.e. « coordinates » variable is empty at each iteration. However if I change line 41 : « acc = each[1] » for « acc = each[1].strip() » (trimming the spaces) it works. I noticed that, sometimes, proteins accession number (Ensembl ENSP) in the mz_to_sqlite input file, comes with a char at the end e.g. 'ENSP00000267884_A82P,P124A '. When, in line 44 (line when we do the query to fill « coordinates » variable), we did the matching with another tool’s input « Peptide_Genomic_Coordinate.sqlite », it does not work well because in this file, protein accessions do not contain this space e.g. 'ENSP00000267884_A82P,P124A'.

To help, I uploaded in the history the input files (data #1, #5 and #7) to execute Peptide Genomic Coordinate and the empty file, resulting from the execution on Galaxy of this tool (data #8). Also, I have uploaded the customized database (data #3) and the output file produced when I executed the tool locally with the modifications commented above (data #9 ).
Thank you for your help.

@jj-umn
Copy link
Member

jj-umn commented Nov 26, 2020

@luisfdez94 @subinamehta Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.

generic|ENSP00000355265 |5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]

Should have been:

generic|ENSP00000355265|5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]

@luisfdez94
Copy link
Author

luisfdez94 commented Nov 26, 2020

Thank you for your fast answer.

Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.

There it goes: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissuedbcreation
As you can see at item #30, I also use the tool Regex Find And Replace to add generic| before each sequence in the customized DB (item #28): e.g. >ENSP00000355265 to >generic|ENSP00000355265 . I do this in order, SearchGui works.

@jj-umn
Copy link
Member

jj-umn commented Nov 30, 2020

@luisfdez94 @subinamehta The protein IDs from customProDB as a SPACE character at the end of the ID. I'm asking @chambm if that seems correct. Most steps in these workflows do not handle an ID ending with a SPACE. We could add a steps after customProDB using regex tools to remove the SPACE.

@subinamehta
Copy link
Contributor

subinamehta commented Nov 30, 2020 via email

@luisfdez94
Copy link
Author

Thanks to @jj-umn and Galaxy-P team help, I have been able to solve this issue. I had to do some modifications to the headers of every sequence of the customized DB (fasta) obtained at the end of Galaxy-P Tutorial 1 : Database creation. For that purpose I have used Regex Find And Replace v1.0.0 tool with the parameters shown at the end of the message.

  • Delete spaces at the end of each protein's accession ID (for Ensembl-PRO and STRG database)[see checks 3 and 5 at the end of the message]
  • Indel and snv reformatting (coming from CustomProDB) [checks 1 and 2] This is also done in Galaxy-P Tutorial 1 : Database creation section Genomic mapping database In this way our genomic mapping is consistent with our protein database.
  • Add generic| prefix at the beginning of each header coming from Ensembl-PRO and STRG database (not standard format for SearchGUI). See http://compomics.github.io/projects/searchgui/wiki/DatabaseHelp [checks 1, 3, 4 and 5]

Another important point if you follow Galaxy-P hands on tutorials is to input this modified Custom DB to mz_to_sqlite tool at Tutorial 2: DB search! I was inputting the original custom DB.

Regex Find And Replace v1.0.0

  1. check: from ">ENSP00000360709_1123:GAT>CAAT " to ">generic|ENSP00000360709_1123:GAT_CAAT"
  • “Find Regex”: >(ENS.*_\d+:)([ACGTacgt]+)>([ACGTacgt]+)\s*
  • "Replacement": >generic|\1\2_\3
  1. check: from ">ENSP00000457107_D77*,L258P,L264P,S457G " to ">ENSP00000457107_D77*.L258P.L264P.S457G "
  • “Find Regex”: ([A-Z,*][0-9]+[A-Z,*]),
  • "Replacement”: \1.
  1. check: delete the final espace at the end of a protein accession ID (Ensembl DB). Also add the prefix "generic". From "ENSP00000360709 " to ">generic|ENSP00000360709"
  • “Find Regex”: >ENS[A-Z]*(.*)\s\|
  • "Replacement”: >generic|ENSP\1|
  1. check: from ">STRG00000058243|" to ">generic|STRG00000058243|"
  • “Find Regex”: >STRG(\S*)\|
  • "Replacement”: >generic|STRG\1|
  1. check: from ">STRG00000058243 |" to ">generic|STRG00000058243|"
  • “Find Regex”: >STRG(.*)\s\|
  • "Replacement”: >generic|STRG\1|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants