This codebase is being developed at Mass Dynamics https://www.massdynamics.com/ to facilitate work on the Protein Inference Problem.
It has been developed using the output of crux comet/percolator (eg: percolator.target.psms.txt), but requires only that psm tables contain q-value and score columns, although more information about psms could theoretically be integrated into scoring methods.
This codebase could be used for the following purposes:
* to develop or benchmark protein inference algorithms using shared data processing and parallelization infrastructure
* to visualize the peptide-protein networks that are being reasoned about by a protein inference algorithm
* to opensource an algorithm that is implimented inside this package
The REPRISAL algorithm, currently unpublished, was the first novel algorithm implemented inside the protein_inference framework and is contained in the distinct subpackage "reprisal".
REPRISAL stands for "REcursive PRotein Inference Scoring ALgorithm" and works by iteratively scoring proteins by their associated peptides and grouping all associated and unallocated peptides with the highest scoring protein at each iteration. FDR's are then calculated as per the score distributions for proteins in the target and decoy tables.
pip install protein-inference
sphinx-build -b html docs/source/ docs/build
conda create -y --name protein-inference python=3.8 --file requirements.txt
Many of these are just used in the benchmarking suite for visualization.
If you have a psms list from percolator (or otherwise) ready to go, you can run the default pipeline easily:
import protein_inference as pi
target_path = "path/to/your/target.psms.txt"
decoy_path = "path/to/your/decoy.psms.txt"
output_directory = "path/to/your/output_directory"
pi.ProteinInferenceRunner().run(target_path,
decoy_path,
output_directory)
You can also run the package via the commandline if you can specify the path to the python directory where you have protein_inference installed.
python -m protein_inference.main \
--output-directory path/to/out/folder \
--target-path path/to/your/target.psms.txt \
--decoy-path path/to/your/decoy.psms.txt
If you'd like to play around with some processed data/visualization, install streamlit and run the protein inference playground app.
pip install streamlit
cd streamlit
streamlit run pi_playground.py -- --folder /path/to/your/pi/output/
You can also start the app without a path to an upload folder, and upload your output files after the fact.
A docker file has been created which creates an environment to run the protein_inference package via a command line interface.
You can build the image from the docker file as so:
docker build --pull --rm -f "Dockerfile" -t proteininference:latest "."
Then you can run the main file from the command line by using a /data volume and passing arguments with paths to files in that folder.
docker run -it --rm -v /data:/data \
proteininference conda run -n protein-inference -c python -m protein_inference.main
--output-directory /data/output \
--target-path /data/input/percolator.target.psms.txt \
--decoy-path /data/input/percolator.decoy.psms.txt
The benchmarking suite is currently undocumented as it is still being developed. I'll get around to this as I work towards publishing REPRISAL as a paper.
While not currently in development, possible additions to this codebase are:
- Writing file converters to accept mzIdentML
- Integrating Baysian methods such as Fido or EPIFANY.
- Integrating an optimisation method, maybe using a package like PULP.
- Further development of benchmarking suite to allow easy comparison of existing algorithms.
Feel free to download the png or pdf of the ASMS 2022 poster on protein inference in python.
Click to Download the poster pdf
NB: Table 1 was removed when trying to improve the app asthetics. I've provided it here for reference
https://share.streamlit.io/massdynamics/protein-inference/main/pi_playground.py