Skip to content

Commit

Permalink
Merge pull request #29 from hgb-bin-proteomics/develop
Browse files Browse the repository at this point in the history
add xiFdrExporter
  • Loading branch information
michabirklbauer authored May 6, 2024
2 parents 93cc846 + 71b1cc7 commit 608247f
Show file tree
Hide file tree
Showing 9 changed files with 191 additions and 12 deletions.
82 changes: 70 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,38 +20,60 @@ FASTA headers need to follow the UniProtKB standard formatting (as described [*h

All of the scripts use Micrsoft Excel files as input, for that MS Annika results need to be exported from Proteome Discoverer. It is recommended to first filter results according to your needs, e.g. filter for high-confidence crosslinks and filter out decoy crosslinks as depicted below.

![PDFilter](filter.png)
### Exporting Crosslinks

![PDFilterCrosslinks](img/crosslinks_filtered.png)

**Figure 1:** Crosslinks filtered for 1% estimated FDR and without decoys.

Results can then be exported by selecting `File > Export > To Microsoft Excel… > Level 1: Crosslinks > Export` in Proteome Discoverer.

### Exporting CSMs

![PDFilterCSMsUnvalidated](img/csms_unfiltered.png)

**Figure 2:** All (unvalidated) CSMs.

![PDFilterCSMsValidated](img/csms_filtered.png)

**Figure 3:** CSMs filtered for 1% estimated FDR and without decoys.

Results can then be exported by selecting `File > Export > To Microsoft Excel… > Level 1: CSMs > Export` in Proteome Discoverer.

## Quick start

- **Exporting to xiNET**
- **Exporting to [xiNET](https://crosslinkviewer.org/)**
Files needed:
- result.xlsx - MS Annika result file(s) exported to .xlsx
- result.xlsx - MS Annika crosslink result file(s) exported to .xlsx
- seq.fasta - FASTA file containing sequences of the crosslinked proteins
```
python xiNetExporter_msannika.py result.xlsx -fasta seq.fasta
```
- **Exporting to xiVIEW**
- **Exporting to [xiVIEW](https://xiview.org/xiNET_website/index.php)**
Files needed:
- result.xlsx - MS Annika result file(s) exported to .xlsx
- result.xlsx - MS Annika crosslink result file(s) exported to .xlsx
- seq.fasta - FASTA file containing sequences of the crosslinked proteins
```
python xiViewExporter_msannika.py result.xlsx -fasta seq.fasta
```
- **Exporting to pyXlinkViewer (pyMOL)**
- **Exporting to [xiFDR](https://github.com/Rappsilber-Laboratory/xiFDR)**
Files needed:
- result.xlsx - MS Annika CSM result file (unvalidated) exported to .xlsx
```
python xiFdrExporter_msannika.py result.xlsx
```
- **Exporting to [pyXlinkViewer (pyMOL)](https://github.com/BobSchiffrin/PyXlinkViewer)**
Files needed:
- result.xlsx - MS Annika result file(s) exported to .xlsx
- result.xlsx - MS Annika crosslink result file(s) exported to .xlsx
- structure.pdb - 3D structure of the protein (complex) that crosslinks should be mapped to, alternatively you can also just provide the 4-letter code from the [PDB](https://www.rcsb.org/) and the script will fetch the structure from internet
```
python pyXlinkViewerExporter_msannika.py result.xlsx -pdb structure.pdb
```
- **Exporting to XLMS-Tools**
- **Exporting to [XLMS-Tools](https://gitlab.com/topf-lab/xlms-tools)**
XLMS-Tools uses the same file format as pyXlinkViewer, therefore the same exporter can be used!
- **Exporting to XMAS (ChimeraX)**
- **Exporting to [XMAS (ChimeraX)](https://github.com/ScheltemaLab/ChimeraX_bundle)**
Visualization of MS Annika results works out of the box with .xlsx files exported from Proteome Discoverer.
- **Exporting to PAE Viewer**
- **Exporting to [PAE Viewer](http://www.subtiwiki.uni-goettingen.de/v4/paeViewerDemo)**
Files needed:
- pyXlinkViewer_export.csv - Crosslinks exported from pyXlinkViewer as .csv
```
Expand Down Expand Up @@ -142,9 +164,45 @@ Or using the Windows binary:
xiViewExporter_msannika.exe "202001216_nsp8_trypsin_XL_REP1.xlsx" "202001216_nsp8_trypsin_XL_REP2.xlsx" "202001216_nsp8_trypsin_XL_REP3.xlsx" --fasta SARS-COV-2.fasta -o test --ignore P0DTC1 P0DTD1 P0DTC2
```

## Export to [xiFDR](https://github.com/Rappsilber-Laboratory/xiFDR)

```
EXPORTER DESCRIPTION:
A script to export MS Annika CSM results (.xlsx) to a xiFDR input file (.csv).
CSMs should be unfiltered, therefore include decoys and not be validated for any
FDR.
Warning: This exporter currently only reports one/the first protein for
ambiguous peptides that are found in more than one protein!
USAGE:
xiFdrExporter_msannika.py f [f]
[-o OUTPUT]
[-h]
[--version]
positional arguments:
f Crosslink-Spectrum-Matches (CSMs) exported from
MS Annika in Microsoft Excel (.xlsx) format.
optional arguments:
-o OUTPUT, --output OUTPUT
Prefix of the output file.
-h, --help show this help message and exit
--version show program's version number and exit
```

Example usage:

```
python xiFdrExporter_msannika.py XLpeplib_Beveridge_QEx-HFX_DSS_R1.xlsx
```

Or using the Windows binary:

```
xiFdrExporter_msannika.exe XLpeplib_Beveridge_QEx-HFX_DSS_R1.xlsx
```

## Export to [PyXlinkViewer for pyMOL](https://github.com/BobSchiffrin/PyXlinkViewer)

A schematic workflow of the implementation can be seen in [*this figure*](workflow_pyMOLexporter.png).
A schematic workflow of the implementation can be seen in [*this figure*](img/workflow_pyMOLexporter.png).

```
EXPORTER DESCRIPTION:
Expand Down Expand Up @@ -217,7 +275,7 @@ Visualization of crosslinks with [XMAS](https://github.com/ScheltemaLab/ChimeraX

Evaluating predicted structures (e.g. structures created with AlphaFold2) using cross-linking data can easily be done using [PAE Viewer](http://www.subtiwiki.uni-goettingen.de/v4/paeViewerDemo). Exporting MS Annika results to the input format of PAE Viewer requires first exporting to pyXlinkViewer (pyMOL) and then exporting crosslinks from pyXlinkViewer to CSV, as shown in the pyMOL screenshot below:

![pyMOLExportScreenshot](pyXlinkViewer_XL_export.png)
![pyMOLExportScreenshot](img/pyXlinkViewer_XL_export.png)

The exporter takes the following arguments:
```
Expand Down
Binary file added binaries/windows/xiFdrExporter_msannika.exe
Binary file not shown.
Binary file not shown.
File renamed without changes
Binary file added img/csms_filtered.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/csms_unfiltered.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
File renamed without changes
121 changes: 121 additions & 0 deletions xiFdrExporter_msannika.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
#!/usr/bin/env python3

# Exporter of MS Annika CSM Results to xiFDR input format
# 2024 (c) Micha Johannes Birklbauer
# https://github.com/michabirklbauer/
# micha.birklbauer@gmail.com

import argparse
import pandas as pd

__version = "1.0.1"
__date = "20240505"

"""
DESCRIPTION:
A script to export MS Annika CSM results (.xlsx) to a xiFDR input file (.csv).
CSMs should be unfiltered, therefore include decoys and not be validated for any
FDR.
Warning: This exporter currently only reports one/the first protein for
ambiguous peptides that are found in more than one protein!
USAGE:
xiFdrExporter_msannika.py f [f]
[-o OUTPUT]
[-h]
[--version]
positional arguments:
f Crosslink-Spectrum-Matches (CSMs) exported from
MS Annika in Microsoft Excel (.xlsx) format.
optional arguments:
-o OUTPUT, --output OUTPUT
Prefix of the output file.
-h, --help show this help message and exit
--version show program's version number and exit
"""

# Exporter class with constructor that takes one MS Annika CSM result file as
# input. CSMs should not be in any way filtered and exported to Microsoft Excel
# .xlsx format from Proteome Discoverer.
class MSAnnika_Exporter:

def __init__(self, input_file: str):
self.input_file = input_file

# static method to generate pandas dataframe of xiFDR export without class
# instance. Takes the file name of the CSM file as input.
@staticmethod
def generate_df(input_file: str) -> pd.DataFrame:

print("Warning: This exporter currently only reports one/the first protein for ambiguous peptides that are found in more than one protein!")

df = pd.read_excel(input_file)
df.rename(columns = {"Spectrum File": "run",
"First Scan": "scan",
"Sequence A": "peptide1",
"Sequence B": "peptide2",
"Crosslinker Position A": "peptide link 1",
"Crosslinker Position B": "peptide link 2",
"Charge": "precursor charge",
"Combined Score": "score",
"Score Alpha": "peptide1 score",
"Score Beta": "peptide2 score",
"Accession A": "accession1",
"Accession B": "accession2",
"A in protein": "peptide position 1",
"B in protein": "peptide position 2"},
inplace = True,
errors = "raise")
# remove the following two lines if I find out how to denote ambiguous peptides in xiFDR (e.g. peptides that link to more than one protein)
df["accession1"] = df["accession1"].apply(lambda x: x.split(";")[0])
df["accession2"] = df["accession2"].apply(lambda x: x.split(";")[0])
df["is decoy 1"] = df["Alpha T/D"].apply(lambda x: "false" if "t" in str(x).lower() else "true")
df["is decoy 2"] = df["Beta T/D"].apply(lambda x: "false" if "t" in str(x).lower() else "true")
# same issue again - this would be used if xiFDR allows more than protein per peptide
#df["peptide position 1"] = df["peptide position 1"].apply(lambda x: ";".join([str(int(y) + 1) for y in str(x).split(";")]))
#df["peptide position 2"] = df["peptide position 2"].apply(lambda x: ";".join([str(int(y) + 1) for y in str(x).split(";")]))
# remove the following two lines if I figure above out
df["peptide position 1"] = df["peptide position 1"].apply(lambda x: int(x.split(";")[0]) + 1)
df["peptide position 2"] = df["peptide position 2"].apply(lambda x: int(x.split(";")[0]) + 1)

return df

# classmethod implementation of the static generate_df
def __generate_csv_df(self) -> pd.DataFrame:
return self.generate_df(self.input_file)

# export function, takes one argument "output_file" which sets the prefix
# of generated output file
def export(self, output_file: str = None) -> pd.DataFrame:
csv = self.__generate_csv_df()

if output_file is None:
output_file = ".".join(self.input_file.split(".")[:-1])

csv.to_csv(output_file + "_xiFDR.csv", index = False)

return csv

# initialize exporter and export xiFDR csv file
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(metavar = "f",
dest = "file",
help = "Name/Path of the MS Annika CSM result file (in .xlsx format) to process.",
type = str,
nargs = 1)
parser.add_argument("-o", "--output",
dest = "output",
default = None,
help = "Prefix of the output file.",
type = str)
parser.add_argument("--version",
action = "version",
version = __version)
args = parser.parse_args()

exporter = MSAnnika_Exporter(args.file[0])

exporter.export(args.output)

if __name__ == "__main__":
main()

0 comments on commit 608247f

Please sign in to comment.