Merge pull request #29 from hgb-bin-proteomics/develop

add xiFdrExporter
hgb-bin-proteomics · May 6, 2024 · 608247f · 608247f
2 parents 93cc846 + 71b1cc7
commit 608247f
Show file tree

Hide file tree

Showing 9 changed files with 191 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -20,38 +20,60 @@ FASTA headers need to follow the UniProtKB standard formatting (as described [*h
 
 All of the scripts use Micrsoft Excel files as input, for that MS Annika results need to be exported from Proteome Discoverer. It is recommended to first filter results according to your needs, e.g. filter for high-confidence crosslinks and filter out decoy crosslinks as depicted below.
 
-![PDFilter](filter.png)
+### Exporting Crosslinks
+
+![PDFilterCrosslinks](img/crosslinks_filtered.png)
+
+**Figure 1:** Crosslinks filtered for 1% estimated FDR and without decoys.
 
 Results can then be exported by selecting `File > Export > To Microsoft Excel… > Level 1: Crosslinks > Export` in Proteome Discoverer.
 
+### Exporting CSMs
+
+![PDFilterCSMsUnvalidated](img/csms_unfiltered.png)
+
+**Figure 2:** All (unvalidated) CSMs.
+
+![PDFilterCSMsValidated](img/csms_filtered.png)
+
+**Figure 3:** CSMs filtered for 1% estimated FDR and without decoys.
+
+Results can then be exported by selecting `File > Export > To Microsoft Excel… > Level 1: CSMs > Export` in Proteome Discoverer.
+
 ## Quick start
 
-- **Exporting to xiNET**  
+- **Exporting to [xiNET](https://crosslinkviewer.org/)**  
   Files needed:
-  - result.xlsx - MS Annika result file(s) exported to .xlsx
+  - result.xlsx - MS Annika crosslink result file(s) exported to .xlsx
   - seq.fasta - FASTA file containing sequences of the crosslinked proteins
   ```
   python xiNetExporter_msannika.py result.xlsx -fasta seq.fasta
   ```
-- **Exporting to xiVIEW**  
+- **Exporting to [xiVIEW](https://xiview.org/xiNET_website/index.php)**  
   Files needed:
-  - result.xlsx - MS Annika result file(s) exported to .xlsx
+  - result.xlsx - MS Annika crosslink result file(s) exported to .xlsx
   - seq.fasta - FASTA file containing sequences of the crosslinked proteins
   ```
   python xiViewExporter_msannika.py result.xlsx -fasta seq.fasta
   ```
-- **Exporting to pyXlinkViewer (pyMOL)**  
+- **Exporting to [xiFDR](https://github.com/Rappsilber-Laboratory/xiFDR)**  
+  Files needed:
+  - result.xlsx - MS Annika CSM result file (unvalidated) exported to .xlsx
+  ```
+  python xiFdrExporter_msannika.py result.xlsx
+  ```
+- **Exporting to [pyXlinkViewer (pyMOL)](https://github.com/BobSchiffrin/PyXlinkViewer)**  
   Files needed:
-  - result.xlsx - MS Annika result file(s) exported to .xlsx
+  - result.xlsx - MS Annika crosslink result file(s) exported to .xlsx
   - structure.pdb - 3D structure of the protein (complex) that crosslinks should be mapped to, alternatively you can also just provide the 4-letter code from the [PDB](https://www.rcsb.org/) and the script will fetch the structure from internet
   ```
   python pyXlinkViewerExporter_msannika.py result.xlsx -pdb structure.pdb
   ```
-- **Exporting to XLMS-Tools**  
+- **Exporting to [XLMS-Tools](https://gitlab.com/topf-lab/xlms-tools)**  
   XLMS-Tools uses the same file format as pyXlinkViewer, therefore the same exporter can be used!
-- **Exporting to XMAS (ChimeraX)**  
+- **Exporting to [XMAS (ChimeraX)](https://github.com/ScheltemaLab/ChimeraX_bundle)**  
   Visualization of MS Annika results works out of the box with .xlsx files exported from Proteome Discoverer.
-- **Exporting to PAE Viewer**  
+- **Exporting to [PAE Viewer](http://www.subtiwiki.uni-goettingen.de/v4/paeViewerDemo)**  
   Files needed:
   - pyXlinkViewer_export.csv - Crosslinks exported from pyXlinkViewer as .csv
   ```
@@ -142,9 +164,45 @@ Or using the Windows binary:
 xiViewExporter_msannika.exe "202001216_nsp8_trypsin_XL_REP1.xlsx" "202001216_nsp8_trypsin_XL_REP2.xlsx" "202001216_nsp8_trypsin_XL_REP3.xlsx" --fasta SARS-COV-2.fasta -o test --ignore P0DTC1 P0DTD1 P0DTC2
 ```
 
+## Export to [xiFDR](https://github.com/Rappsilber-Laboratory/xiFDR)
+
+```
+EXPORTER DESCRIPTION:
+A script to export MS Annika CSM results (.xlsx) to a xiFDR input file (.csv).
+CSMs should be unfiltered, therefore include decoys and not be validated for any
+FDR.
+Warning: This exporter currently only reports one/the first protein for
+         ambiguous peptides that are found in more than one protein!
+USAGE:
+xiFdrExporter_msannika.py f [f]
+                            [-o OUTPUT]
+                            [-h]
+                            [--version]
+positional arguments:
+  f                     Crosslink-Spectrum-Matches (CSMs) exported from
+                        MS Annika in Microsoft Excel (.xlsx) format.
+optional arguments:
+  -o OUTPUT, --output OUTPUT
+                        Prefix of the output file.
+  -h, --help            show this help message and exit
+  --version             show program's version number and exit
+```
+
+Example usage:
+
+```
+python xiFdrExporter_msannika.py XLpeplib_Beveridge_QEx-HFX_DSS_R1.xlsx
+```
+
+Or using the Windows binary:
+
+```
+xiFdrExporter_msannika.exe XLpeplib_Beveridge_QEx-HFX_DSS_R1.xlsx
+```
+
 ## Export to [PyXlinkViewer for pyMOL](https://github.com/BobSchiffrin/PyXlinkViewer)
 
-A schematic workflow of the implementation can be seen in [*this figure*](workflow_pyMOLexporter.png).
+A schematic workflow of the implementation can be seen in [*this figure*](img/workflow_pyMOLexporter.png).
 
 ```
 EXPORTER DESCRIPTION:
@@ -217,7 +275,7 @@ Visualization of crosslinks with [XMAS](https://github.com/ScheltemaLab/ChimeraX
 
 Evaluating predicted structures (e.g. structures created with AlphaFold2) using cross-linking data can easily be done using [PAE Viewer](http://www.subtiwiki.uni-goettingen.de/v4/paeViewerDemo). Exporting MS Annika results to the input format of PAE Viewer requires first exporting to pyXlinkViewer (pyMOL) and then exporting crosslinks from pyXlinkViewer to CSV, as shown in the pyMOL screenshot below:
 
-![pyMOLExportScreenshot](pyXlinkViewer_XL_export.png)
+![pyMOLExportScreenshot](img/pyXlinkViewer_XL_export.png)
 
 The exporter takes the following arguments:
 ```

diff --git a/binaries/windows/xiFdrExporter_msannika.exe b/binaries/windows/xiFdrExporter_msannika.exe
diff --git a/example_files/XLpeplib_Beveridge_QEx-HFX_DSS_R1.xlsx b/example_files/XLpeplib_Beveridge_QEx-HFX_DSS_R1.xlsx
diff --git a/filter.png → img/crosslinks_filtered.png b/filter.png → img/crosslinks_filtered.png
diff --git a/img/csms_filtered.png b/img/csms_filtered.png
diff --git a/img/csms_unfiltered.png b/img/csms_unfiltered.png
diff --git a/pyXlinkViewer_XL_export.png → img/pyXlinkViewer_XL_export.png b/pyXlinkViewer_XL_export.png → img/pyXlinkViewer_XL_export.png
diff --git a/workflow_pyMOLexporter.png → img/workflow_pyMOLexporter.png b/workflow_pyMOLexporter.png → img/workflow_pyMOLexporter.png
diff --git a/xiFdrExporter_msannika.py b/xiFdrExporter_msannika.py
@@ -0,0 +1,121 @@
+#!/usr/bin/env python3
+
+# Exporter of MS Annika CSM Results to xiFDR input format
+# 2024 (c) Micha Johannes Birklbauer
+# https://github.com/michabirklbauer/
+# micha.birklbauer@gmail.com
+
+import argparse
+import pandas as pd
+
+__version = "1.0.1"
+__date = "20240505"
+
+"""
+DESCRIPTION:
+A script to export MS Annika CSM results (.xlsx) to a xiFDR input file (.csv).
+CSMs should be unfiltered, therefore include decoys and not be validated for any
+FDR.
+Warning: This exporter currently only reports one/the first protein for
+         ambiguous peptides that are found in more than one protein!
+USAGE:
+xiFdrExporter_msannika.py f [f]
+                            [-o OUTPUT]
+                            [-h]
+                            [--version]
+positional arguments:
+  f                     Crosslink-Spectrum-Matches (CSMs) exported from
+                        MS Annika in Microsoft Excel (.xlsx) format.
+optional arguments:
+  -o OUTPUT, --output OUTPUT
+                        Prefix of the output file.
+  -h, --help            show this help message and exit
+  --version             show program's version number and exit
+"""
+
+# Exporter class with constructor that takes one MS Annika CSM result file as
+# input. CSMs should not be in any way filtered and exported to Microsoft Excel
+# .xlsx format from Proteome Discoverer.
+class MSAnnika_Exporter:
+
+    def __init__(self, input_file: str):
+        self.input_file = input_file
+
+    # static method to generate pandas dataframe of xiFDR export without class
+    # instance. Takes the file name of the CSM file as input.
+    @staticmethod
+    def generate_df(input_file: str) -> pd.DataFrame:
+
+        print("Warning: This exporter currently only reports one/the first protein for ambiguous peptides that are found in more than one protein!")
+
+        df = pd.read_excel(input_file)
+        df.rename(columns = {"Spectrum File": "run",
+                             "First Scan": "scan",
+                             "Sequence A": "peptide1",
+                             "Sequence B": "peptide2",
+                             "Crosslinker Position A": "peptide link 1",
+                             "Crosslinker Position B": "peptide link 2",
+                             "Charge": "precursor charge",
+                             "Combined Score": "score",
+                             "Score Alpha": "peptide1 score",
+                             "Score Beta": "peptide2 score",
+                             "Accession A": "accession1",
+                             "Accession B": "accession2",
+                             "A in protein": "peptide position 1",
+                             "B in protein": "peptide position 2"},
+                  inplace = True,
+                  errors = "raise")
+        # remove the following two lines if I find out how to denote ambiguous peptides in xiFDR (e.g. peptides that link to more than one protein)
+        df["accession1"] = df["accession1"].apply(lambda x: x.split(";")[0])
+        df["accession2"] = df["accession2"].apply(lambda x: x.split(";")[0])
+        df["is decoy 1"] = df["Alpha T/D"].apply(lambda x: "false" if "t" in str(x).lower() else "true")
+        df["is decoy 2"] = df["Beta T/D"].apply(lambda x: "false" if "t" in str(x).lower() else "true")
+        # same issue again - this would be used if xiFDR allows more than protein per peptide
+        #df["peptide position 1"] = df["peptide position 1"].apply(lambda x: ";".join([str(int(y) + 1) for y in str(x).split(";")]))
+        #df["peptide position 2"] = df["peptide position 2"].apply(lambda x: ";".join([str(int(y) + 1) for y in str(x).split(";")]))
+        # remove the following two lines if I figure above out
+        df["peptide position 1"] = df["peptide position 1"].apply(lambda x: int(x.split(";")[0]) + 1)
+        df["peptide position 2"] = df["peptide position 2"].apply(lambda x: int(x.split(";")[0]) + 1)
+
+        return df
+
+    # classmethod implementation of the static generate_df
+    def __generate_csv_df(self) -> pd.DataFrame:
+        return self.generate_df(self.input_file)
+
+    # export function, takes one argument "output_file" which sets the prefix
+    # of generated output file
+    def export(self, output_file: str = None) -> pd.DataFrame:
+        csv = self.__generate_csv_df()
+
+        if output_file is None:
+            output_file = ".".join(self.input_file.split(".")[:-1])
+
+        csv.to_csv(output_file + "_xiFDR.csv", index = False)
+
+        return csv
+
+# initialize exporter and export xiFDR csv file
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(metavar = "f",
+                        dest = "file",
+                        help = "Name/Path of the MS Annika CSM result file (in .xlsx format) to process.",
+                        type = str,
+                        nargs = 1)
+    parser.add_argument("-o", "--output",
+                        dest = "output",
+                        default = None,
+                        help = "Prefix of the output file.",
+                        type = str)
+    parser.add_argument("--version",
+                        action = "version",
+                        version = __version)
+    args = parser.parse_args()
+
+    exporter = MSAnnika_Exporter(args.file[0])
+
+    exporter.export(args.output)
+
+if __name__ == "__main__":
+    main()