Authors: Andrés Aldana, Michael Sebek, Gordana Ispirova, Rodrigo Dorantes-Gilardi, Giulia Menichetti (giulia.menichetti@channing.harvard.edu)
Network medicine is a post-genomic discipline that harnesses network science principles to analyze the intricate interactions within biological systems, viewing diseases as localized disruptions in networks of genes, proteins, and other molecular entities 1.
The structure of the biological network plays an essential role in the system’s ability to efficiently propagate signals and withstand random failures. Consequently, most analyses in Network Medicine focus on quantifying the efficiency of the communication between different regions of the interactome or protein-protein interaction network.
NetMedPy evaluates network localization (statistical analysis of the largest connected component/subgraph or LCC) 2, calculates proximity 3 and separation 2 between biological entities, and conducts screenings involving a large number of diseases and drug targets. NetMedPy extends the traditional Network Medicine analyses by providing four default network metrics (shortest paths, random walk, biased random walk, communicability) and four null models (perfect degree match, degree logarithmic binning, strength logarithmic binning, uniform). The user is allowed to introduce custom metrics and null models.
The pipeline workflow is depicted in the figure below.
This Python implementation uses precomputed distance matrices to optimize calculations. With precalculated distances between every node pair, the code can rapidly compute proximity and separation.
- Installing the necessary dependencies:
Working with Conda is recommended, but it is not essential. If you choose to work with Conda, these are the steps you need to take:
-
Ensure you have Conda installed.
-
Download the environment.yml and navigate to the directory of your local/remote machine where the file is located.
-
Create a new conda environment with the
environment.yml
file:conda env create -f environment.yml
-
Activate your new conda environment:
conda activate netmedpy_environment
-
Ensure the following dependencies are installed before proceeding:
pip install networkx seaborn matplotlib numpy pandas ray scipy
- Install the package:
pip install netmedpy
-
Download the directory
examples
. -
Navigate to the directory
examples
:
cd /user_path_to/examples
- Run the
Basic_example.py
script using Python 3 or higher:
python Basic_example.py
-
Ensure you have Python installed.
-
Copy the project to your local or remote machine:
git clone https://github.com/menicgiulia/NetMedPy.git
- Navigate to the project directory:
cd NetMedPy-main
- Installing the necessary dependencies:
Working with Conda is recommended, but it is not essential. If you choose to work with Conda, these are the steps you need to take:
-
Ensure you have Conda installed.
-
Create a new conda environment with the
environment.yml
file:conda env create -f environment.yml
-
Activate your new conda environment:
conda activate netmedpy_environment
-
Ensure the following dependencies are installed before proceeding:
pip install networkx seaborn matplotlib numpy pandas ray scipy
-
Set up your PYTHONPATH (Replace
/user_path_to/NetMedPy-main/netmedpy
with the appropriate path of the package in your local/remote machine.):On Linux/Mac:
export PYTHONPATH="/user_path_to/NetMedPy-main/netmedpy":$PYTHONPATH
On Windows shell:
set PYTHONPATH="C:\\user_path_to\\NetMedPy-main\\netmedpy";%PYTHONPATH%
On Powershell:
$env:PYTHONPATH = "C:\\user_path_to\\NetMedPy-main\\netmedpy;" + $env:PYTHONPATH
-
Navigate to the directory
examples
:
cd examples
- Run the
Basic_example.py
script using Python 3 or higher (up to 3.11.9, due to conflicts withRay
):
python Basic_example.py
To test the pipeline, we refer to the Vitamin D example, which can be found in the examples/VitaminD
directory. There are two files that you can use for testing:
- A Python script:
VitD_pipeline.py
- A Jupyter notebook:
VitD_pipeline.ipynb
-
Download the
examples
directory: If you haven't already done so, download theexamples
directory from the repository to your local or remote machine. This directory contains all the necessary files to run the examples. -
Prepare the Data: In the subdirectory
VitaminD/data
there are the files that contain the necessary data to execute the example, ensure the data files there. The output files will be stored in theVitaminD/output
subdirectory. -
Navigate to the
VitaminD
directory:
cd /user_path_to/examples/VitaminD
- Run the Example:
python VitD_pipeline.py
-
Make sure you have the
jupyter
package installed.pip install jupyter
-
Start the Jupyter Kernel
a) If you are working on a local machine:
jupyter notebook --browser="browser_of_choice"
Replace
browser_of_choice
with your preferred browser (e.g., chrome, firefox). The browser window should pop up automatically. If it doesn't, copy and paste the link provided in the terminal into your browser. The link should look something like this:b) If you are working on a remote machine:
jupyter notebook --no-browser
Then copy and paste the link provided in the terminal in your local browser of choice. It should look something like this:
-
Navigate to the
VitD_pipeline.ipynb
in the Jupyter Notebook interface and start executing the cells.
-
From a dictionary of diseases
disease_genes
the function lcc_significance will calculate the statistical significance of the size of the Largest Connected Component (LCC) of a subgraph induced by the node setgenes
in the networkppi
. This function generates a null model distribution for the LCC size by resampling nodes from the network while preserving their degrees (null_model="log_binning"
). The statistical significance of the observed LCC size is then determined by comparing it against this null model distribution. -
The parameter
null_model
can bedegree_match
,log_binning
,uniform
, orcustom
(defined by the user).
#Load disease genes dictonary from the pickle file in `examples/VitaminD/data/disease_genes.pkl`
with open("examples/VitaminD/data/disease_genes.pkl","rb") as file:
disease_genes = pickle.load(file)
lcc_size = pd.DataFrame(columns = ["disease","size","zscore","pval"])
for d,genes in disease_genes.items():
data = netmedpy.lcc_significance(ppi, genes,
null_model="log_binning",n_iter=10000)
new_line = [d,data["lcc_size"],data["z_score"],data["p_val"]]
lcc_size.loc[len(lcc_size.index)] = new_line
#Keep only diseases with an LCC larger than 10 and statistically significant
#Filtering the disease sets to the LCC is optional and not mandatory for the subsequent analyses
significant = lcc_size.query("size > 10 and zscore > 2 and pval<0.05")
disease_names = significant.disease
Evaluate Average Minimum Shortest Path Length (AMSPL) between Vitamin D and Inflammation and between Vitamin D and Factor IX Deficiency disease
- The function proximity calculates the proximity between two sets of nodes in a given graph based on the approach described by Guney et al., 2016. The method computes either the average minimum shortest path length (AMSPL) or its symmetrical version (SASPL) between two sets of nodes.
In this example, the function calculates the proximity between the Vitamin D targets stored in examples/VitaminD/data/vitd_targets.pkl
and the disease genes from the examples/VitaminD/data/disease_genes.pkl
file for the two diseases: Inflammation
and Factor IX Deficiency
. The null model of choice, in this case, is log_binning
.
- The function returns a dictionary containing various statistics related to proximity, including:
- 'd_mu': The average distance in the randomized samples.
- 'd_sigma': The standard deviation of distances in the randomized samples.
- 'z_score': The z-score of the actual distance in relation to the randomized samples.
- 'p_value_single_tail': One-tail P-value associated with the proximity z-score
- 'p_value_double_tail': Two-tail P-value associated with the proximity z-score
- 'p_val': P-value associated with the z-score.
- 'raw_amspl': The raw average minimum shortest path length between the two sets of interest.
- 'dist': A list containing distances from each randomization iteration.
#Load PPI network
with open("examples/VitaminD/data/ppi_network.pkl","rb") as file:
ppi = pickle.load(file)
#Load drug targets
with open("examples/VitaminD/data/vitd_targets.pkl","rb") as file:
targets = pickle.load(file)
#Load disease genes
with open("examples/VitaminD/data/disease_genes.pkl","rb") as file:
disease_genes = pickle.load(file)
inflammation = netmedpy.proximity(ppi, targets,
dgenes["Inflammation"], sp_distance,
null_model="log_binning",n_iter=10000,
symmetric=False)
factorix = netmedpy.proximity(ppi, targets,
dgenes["Factor IX Deficiency"], sp_distance,
null_model="log_binning",n_iter=10000,
symmetric=False)
plot_histograms(inflammation, factorix)
-
The function
all_pair_distances
calculates distances between every pair of nodes in a graph according to the specified method and returns a DistanceMatrix object. This function supports multiple distance calculation methods, including shortest path, various types of random walks, and user-defined methods. -
The function
screening
screens for relationships between sets of source and target nodes within a given network, evaluating proximity or separation. This function facilitates drug repurposing and other network medicine applications by allowing the assessment of network-based relationships. -
In this example using the
all_pair_distances
function the distance between every pair of nodes in the protein-protein interaction network stored in the fileexamples/VitaminD/data/ppi_network.pkl
are calculated, using different parameters for the method of calculation:random_walk
,biased_random_walk
, andcommunicability
. -
For each calculation of the distance matrix the AMSPL is calculated using the
screening
function evaluatingproximity
.
#Load PPI network
with open("examples/VitaminD/data/ppi_network.pkl","rb") as file:
ppi = pickle.load(file)
#Load drug targets
with open("examples/VitaminD/data/vitd_targets.pkl","rb") as file:
targets = pickle.load(file)
#Load disease genes
with open("examples/VitaminD/data/disease_genes.pkl","rb") as file:
disease_genes = pickle.load(file)
#Shortest Paths
amspl = {"Shortest Path":screen_data["raw_amspl"]}
#Random Walks
sp_distance = netmedpy.all_pair_distances(ppi,distance="random_walk")
screen_data = netmedpy.screening(vit_d, dgenes, ppi,
sp_distance,score="proximity",
properties=["raw_amspl"],
null_model="log_binning",
n_iter=10,n_procs=20)
amspl["Random Walks"] = screen_data["raw_amspl"]
#Biased Random Walks
sp_distance = netmedpy.all_pair_distances(ppi,distance="biased_random_walk")
screen_data = netmedpy.screening(vit_d, dgenes, ppi,
sp_distance,score="proximity",
properties=["raw_amspl"],
null_model="log_binning",
n_iter=10,n_procs=20)
amspl["Biased Random Walks"] = screen_data["raw_amspl"]
#Communicability
sp_distance = netmedpy.all_pair_distances(ppi,distance="communicability")
screen_data = netmedpy.screening(vit_d, dgenes, ppi,
sp_distance,score="proximity",
properties=["raw_amspl"],
null_model="log_binning",
n_iter=10,n_procs=20)
amspl["Communicability"] = screen_data["raw_amspl"]
Root folder organization (init.py files removed for simplicity):
│ .gitignore
│ environment.yml // yml file to create conda enviorement
│ README.md
│ setup.py
│
├───images // directory with figures from paper
│ └───OverviewPipeline.png // pipeline flowchart figure from paper
│
└───examples // directory with working examples using the NetMedPy pipeline
│ │
│ ├───VitamindD // directory with Vitamin D example using the NetMedPy pipeline
│ │ ├───Figures_v2.py // python script to recreate the figures from the paper
│ │ ├───VitD_pipeline.py // python script with Vitamin D example using the NetMedPy pipeline
│ │ ├───VitD_pipeline.ipynb // Jupyter notebook with Vitamin D example using the NetMedPy pipeline
│ │ ├───data // directory with pickle and csv files necessary to get the Vitamin D example working
│ │ │ ├───Alias.csv
│ │ │ ├───disease_genes.pkl
│ │ │ ├───ppi_network.pkl
│ │ │ └───vitd_targets.pkl
│ │ └───output // directory where the output files from the Vitamin D example are saved
│ │
│ └───Basic_example.py // python script with dummy data to test the pipeline
│
└───netmedpy // directory containing the python scripts that contain the functions of the NetMedPy pipeline
├───DistanceMatrix.py
└───NetMedPy.py
- Details about each function (what is it used for, what are the input parameters, the possible values of the input parameters, what is the output) from the pipeline are available in the
netmedpy/NetMedPy.py
script in the comments before each function. - An example on the use of the implemented functions is available in the file `examples/Basic_example.py', which can be executed fairly quickly in order to test the proper installation of the package and it's functionalities.
- A more elaborate example is available in the files
examples/VitaminD/VitD_pipeline.py
andexamples/VitaminD/VitD_pipeline.ipynb
, testing the functions with different parameters for evaluating the role of Vitamin D in the modulation of different diseases from a network medicine perspective. The data files (the protein-protein interation network, the disease genes, and the Vitamin D targets) needed for executing this example are available inexamples/VitaminD/data
.
This project is licensed under the terms of the MIT license.
1 Barabási, A. L., Gulbahce, N., & Loscalzo, J. (2011). Network medicine: a network-based approach to human disease. Nature reviews genetics, 12(1), 56-68.DOI 10.1038/nrg2918 ↩
2 Menche, Jörg, et al. "Uncovering disease-disease relationships through the incomplete interactome." Science 347.6224 (2015). DOI 10.1126/science.1257601 ↩
3 Guney, Emre, et al. "Network-based in silico drug efficacy screening." Nature Communications 7,1 (2015). DOI 10.1038/ncomms10331 ↩