ProHap Graph

Pipeline to transform the ProHap files and PSM lists into a Neo4j-based graph database.

Requirements

The pipeline requires Snakemake and Conda installed. You can install these following this guide, Installation via Conda/Mamba.

You will need a local or remote running deployment of Neo4j with the APOC plugin. We have tested the current version of this pipeline with Neo4j v.5.24.2. For a deployment of Neo4j on an Ubuntu server, this tutorial may be helpful.

Input Files and Usage

The three output files produced by ProHap - concatenated FASTA, haplotype table, and haplotype FASTA. If using one of the published ProHap databases (e.g., the protein haplotypes of the 1000 Genomes panel, these are the F1, F2, and F3 files. For F1, use the full rather than the simplified format.
TSV files containing a list of PSMs from a proteomic search using these databases, annotated with ProHap Peptide Annotator. In addition to the standard output of the Peptide Annotator, the following columns should be included:
- posterior_error_prob: Estimated posterior error probability of the PSM,
- q-value: Estimated Q value of the PSM,
- USI: The universal spectrum identifier for this PSM (ideally also containing the peptide sequence and precursor charge),
- rt_Abs_error: The difference between the observed and predicted retention time,
- SpectrumTitle: Identifier of the spectrum within the raw file (e.g., "scan=28285"),
- SpectrumFilename: Name of the raw file containing the spectrum, without the suffix (e.g., 240511_S123_plasma_R1, the PSM comes from a search on the file 240511_S123_plasma_R1.raw). This has to match the file names in the metadata file as below.
Metadata file (SDRF or similar) containing information on the raw files used in the search

Usage:

Clone this repository: git clone https://github.com/ProGenNo/ProHap_Graph.git; cd ProHap_Graph/;
Create a configuration file called config.yaml based on the instructions in config_example.yaml
Test Snakemake with a dry-run: snakemake --cores <# provided cores> -n -q
Run the Snakemake pipeline to create your protein database: snakemake --cores <# provided cores> -p --use-conda

Results

The diagram below summarises the contents of the resulting graph database. Please see the project's Wiki page (currently under development) for details.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
condaenv.yaml		condaenv.yaml
config_example.yaml		config_example.yaml
gene_transcript_ids_110_full.csv		gene_transcript_ids_110_full.csv
protein_transcript_ids_110.csv		protein_transcript_ids_110.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProHap Graph

Requirements

Input Files and Usage

Results

About

Uh oh!

Releases

Packages

Languages

License

ProGenNo/ProHap_Graph

Folders and files

Latest commit

History

Repository files navigation

ProHap Graph

Requirements

Input Files and Usage

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages