Skip to content

A Python package for network analysis through diffusion label propagation algorithms

License

Notifications You must be signed in to change notification settings

multipaths/DiffuPy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

https://github.com/multipaths/diffupy/blob/master/docs/source/meta/diffupy_logo.png

Introduction Build Status Documentation Status zenodo

DiffuPy is a generalizable Python implementation of the numerous label propagation algorithms. DiffuPy supports generic graph formats such as JSON, CSV, GraphML, or GML. Check out DiffuPy's documentation here.

Citation

If you use DiffuPy in your work, please consider citing:

MarĂ­n-LlaĂł, J., et al. (2020). MultiPaths: a python framework for analyzing multi-layer biological networks using diffusion algorithms. Bioinformatics, 37(1), 137-139.

Installation Current version on PyPI Stable Supported Python Versions Apache-2.0

The latest stable code can be installed from PyPI with:

$ python3 -m pip install diffupy

The most recent code can be installed from the source on GitHub with:

$ python3 -m pip install git+https://github.com/multipaths/DiffuPy.git

For developers, the repository can be cloned from GitHub and installed in editable mode with:

$ git clone https://github.com/multipaths/DiffuPy.git
$ cd diffupy
$ python3 -m pip install -e .

Basic Usage

The two required input elements to run diffusion using DiffuPy are:
  1. A network/graph. (see Network-Input Formatting below)
  2. A dataset of scores/ponderations. (see Scores-Input Formatting below)

Alternative text

For its usability, you can either:

  • Use the Command Line Interface (see down).
  • Use pythonicaly the functions provided in diffupy.diffuse:
from diffupy.diffuse import run_diffusion

# DATA INPUT and GRAPH as PATHs -> returned as *PandasDataFrame*
diffusion_scores = run_diffusion(~/data/input_scores.csv, ~/data/network.csv).as_pd_dataframe()

# DATA INPUT and GRAPH as Python OBJECTS -> exported *as_csv*
diffusion_scores = run_diffusion(input_scores, network).as_csv('~/output/diffusion_results.csv')

Methods

The diffusion method by default is z, which statistical normalization has previously shown outperformance. Further parameters to adapt the propagation procedure can be provided, such as choosing among the available diffusion methods or providing a custom method function. See the diffusion Methods and/or Method modularity.

diffusion_scores_select_method = run_diffusion(input_scores, method = 'raw')

from networkx import page_rank # Custom method function

diffusion_scores_custom_method = run_diffusion(input_scores, method = page_rank)

You can also provide your own kernel method or select among other provided in kernels.py function you can provide it as kernel_method argument. By default regularised_laplacian_kernel is used.

from diffupath.kernels import p_step_kernel # Custom kernel calculation function

diffusion_scores_custom_kernel_method = run_diffusion(input_scores, method = 'raw', kernel_method = p_step_kernel)

So method stands for the diffusion process method, and kernel_method for the kernel calculation method.

Command Line Interface

The following commands can be used directly from your terminal:

1. Run a diffusion analysis The following command will run a diffusion method on a given network with the given data. More information here.

$ python3 -m diffupy diffuse --network=<path-to-network-file> --data=<path-to-data-file> --method=<method>

2. Generate a kernel with one of the seven methods implemented Generates the regularised Laplacian kernel of a given graph. More information in the documentation.

$ python3 -m diffupy kernel --network=<path-to-network-file>

Formatting

Before running diffusion algorithms on your network using DiffuPy, take into account the graph and input data/scores formats. You can find specified here samples of supported input scores and networks.

Input format

The input is preprocessed and further mapped before the diffusion. See input mapping or or see process_input docs for further details. Here are exposed the covered input formats for its preprocessing.

Scores

You can submit your dataset in any of the following formats:

  • CSV (.csv)
  • TSV (.tsv)
  • pandas.DataFrame
  • List
  • Dictionary

(check Input dataset examples)

So you can either provide a path to a .csv or .tsv file:

from diffupy.diffuse import run_diffusion

diffusion_scores_from_file = run_diffusion('~/data/diffusion_scores.csv', network)

or Pythonicaly as a data structure as the input_scores parameter:

data = {'Node':  ['A', 'B',...],
      'Node Type': ['Metabolite', 'Gene',...],
       ....
      }
df = pd.DataFrame (data, columns = ['Node','Node Type',...])

diffusion_scores_from_dict = run_diffusion(df, network)

Please ensure that the dataset minimally has a column 'Node' containing node IDs. You can also optionally add the following columns to your dataset:

  • NodeType
  • LogFC [*]
  • p-value
[*]Log2 fold change

Networks

If you would like to submit your own networks, please ensure they are in one of the following formats:

  • BEL (.bel)
  • CSV (.csv)
  • Edge list (.lst)
  • GML (.gml or .xml)
  • GraphML (.graphml or .xml)
  • Pickle (.pickle). BELGraph object from PyBEL 0.13.2
  • TSV (.tsv)
  • TXT (.txt)

Minimally, please ensure each of the following columns are included in the network file you submit:

  • Source
  • Target

Optionally, you can choose to add a third column, "Relation" in your network (as in the example below). If the relation between the Source and Target nodes is omitted, and/or if the directionality is ambiguous, either node can be assigned as the Source or Target.

Kernel

If you dispose of a precalculated kernel, you can provide it directly as the network parameter without needing to also provide a graph object.

Input dataset examples

DiffuPath accepts several input formats which can be codified in different ways. See the diffusion scores summary for more details on how the labels input are treated accorging each available method.

1. You can provide a dataset with a column 'Node' containing node IDs.

Node
A
B
C
D
from diffupy.diffuse import run_diffusion

diffusion_scores = run_diffusion(dataframe_nodes, network)

Also as a list of nodes:

['A', 'B', 'C', 'D']
diffusion_scores = run_diffusion(['A', 'B', 'C', 'D'], network)

2. You can also provide a dataset with a column 'Node' containing node IDs as well as a column 'NodeType', indicating the entity type of the node to run diffusion by entity type.

Node NodeType
A Gene
B Gene
C Metabolite
D Gene

Also as a dictionary of type:list of nodes :

{'Gene': ['A', 'B', 'D'], 'Metabolite': ['C']}
diffusion_scores = run_diffusion({'Genes': ['A', 'B', 'D'], 'Metabolites': ['C']}, network)

3. You can also choose to provide a dataset with a column 'Node' containing node IDs as well as a column 'logFC' with their logFC. You may also add a 'NodeType' column to run diffusion by entity type.

Node LogFC
A 4
B -1
C 1.5
D 3

Also as a dictionary of node:score_value :

{'A':-1, 'B':-1, 'C':1.5, 'D':4}
diffusion_scores = run_diffusion({'A':-1, 'B':-1, 'C':1.5, 'D':4})

Combining point 2., you can also indicating the node type:

Node LogFC NodeType
A 4 Gene
B -1 Gene
C 1.5 Metabolite
D 3 Gene

Also as a dictionary of type:node:score_value :

{Gene: {A:-1, B:-1, D:4}, Metabolite: {C:1.5}}

diffusion_scores = run_diffusion({Gene: {A:-1, B:-1, D:4}, Metabolite: {C:1.5}}, network)

4. Finally, you can provide a dataset with a column 'Node' containing node IDs, a column 'logFC' with their logFC and a column 'p-value' with adjusted p-values. You may also add a 'NodeType' column to run diffusion by entity type.

Node LogFC p-value
A 4 0.03
B -1 0.05
C 1.5 0.001
D 3 0.07

This only accepted pythonicaly in dataaframe format.

See the sample datasets directory for example files.

Custom-network example

Source Target Relation
A B Increase
B C Association
A D Association

You can also take a look at our sample networks folder for some examples.

Input Mapping/Coverage

Even though it is not relevant for the input user usage, it is relevant for the diffusion process assessment taking into account the input mapped entities over the background network, since the coverage of the input implies the actual entities-scores that are being diffused. In other words, only will be further processed for diffusion, the entities which label matches an entity in the network.

The diffusion running will report the mapping as follows:

Mapping descriptive statistics

wikipathways:
gene_nodes  (474 mapped entities, 15.38% input coverage)
mirna_nodes  (2 mapped entities, 4.65% input coverage)
metabolite_nodes  (12 mapped entities, 75.0% input coverage)
bp_nodes  (1 mapped entities, 0.45% input coverage)
total  (489 mapped entities, 14.54% input coverage)

kegg:
gene_nodes  (1041 mapped entities, 33.80% input coverage)
mirna_nodes  (3 mapped entities, 6.98% input coverage)
metabolite_nodes  (6 mapped entities, 0.375% input coverage)
bp_nodes  (12 mapped entities, 5.36% input coverage)
total  (1062 mapped entities, 31.58% input coverage)

reactome:
gene_nodes  (709 mapped entities, 23.02% input coverage)
mirna_nodes  (1 mapped entities, 2.33% input coverage)
metabolite_nodes  (6 mapped entities, 37.5% input coverage)
total  (716 mapped entities, 22.8% input coverage)

total:
gene_nodes  (1461 mapped entities, 43.44% input coverage)
mirna_nodes  (4 mapped entities, 0.12% input coverage)
metabolite_nodes  (13 mapped entities, 0.38% input coverage)
bp_nodes  (13 mapped entities, 0.39% input coverage)
total  (1491 mapped entities, 44.34% input coverage)

To graphically see the mapping coverage, you can also plot a heatmap view of the mapping (see views). To see how the mapping is performed over a input pipeline preprocessing, take a look at this Jupyter Notebook or see process_input docs in DiffuPy.

Output format

The returned format is a custom Matrix type, with node labels as rows and a column with the diffusion score, which can be exported into the following formats:

diffusion_scores.to_dict()
diffusion_scores.as_pd_dataframe()
diffusion_scores.as_csv()
diffusion_scores.to_nx_graph()

Disclaimer

DiffuPy is a scientific software that has been developed in an academic capacity, and thus comes with no warranty or guarantee of maintenance, support, or back-up of data.