rewriting paper

clreda · Sep 28, 2023 · 10e0dec · 10e0dec
1 parent 89f61e5
commit 10e0dec
Show file tree

Hide file tree

Showing 4 changed files with 80 additions and 38 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -1,6 +1,9 @@
 name: NORDic post-pushing testing
 
-on: [push]
+#on: [push]
+on:
+  release:
+    types: [published]
 
 jobs:
   build:

diff --git a/docs/index.rst b/docs/index.rst
@@ -13,6 +13,7 @@ Being able to build in an automated and reproducible way a model of gene interac
 .. toctree::
    :maxdepth: 4
 
+   paper
    install
    content
    modules

diff --git a/docs/paper.rst b/docs/paper.rst
@@ -0,0 +1,65 @@
+Introduction to NORDic
+----------------------
+
+Genes, proteins and messenger RNAs are shown to interact on each other in order to modulate gene activity. Conversely, gene activity impacts protein production, and consequently triggers the chemical reactions needed for survival in healthy individuals. As such, perturbations of these gene regulatory interactions, through (epi)genetic and/or environmental factors, might cause diseases: e.g., the suppression of the activity of gene SCN1A is linked to a specific type of epilepsy called Dravet syndrome, both in mice and humans `[Kalume et al., 2013] <https://doi.org/10.1172/JCI66220>`_. Gene regulatory networks, which are graphs connecting biological entities according to their known regulatory interactions, are useful models that enable a better understanding of those regulatory mechanisms `[Karlebach and Shamir, 2008] <https://doi.org/10.1038/nrm2503>`_. 
+
+In particular, one type of gene regulatory networks, called Boolean networks, allows the definition of so-called regulatory functions `[Thomas, 1973] <https://doi.org/10.1016/0022-5193(74)90172-6>`_; `[Kauffman, 1969] <https://doi.org/10.1016/0022-5193(69)90015-0>`_. Those functions are specific to each node in the graph, and determine the activity of this node according to its regulators. Those functions are defined on the Boolean domain (**True** or **False**), meaning that we only consider binary gene activities. Subsequently, studying this type of networks as a dynamical system (and determining their basins of attraction, for instance) remains rather tractable `[Moon, Lee and Pauleve, 2022] <https://doi.org/10.48550/arXiv.2212.12756>`_. The potential applications are numerous. Taking into account network dynamics should improve tools originally developed using non Boolean networks, such as the Transcription Factor Influence score in CoRegNet `[Nicolle, Radvanyi and Elati, 2015] <https://doi.org/10.1093/bioinformatics/btv305>`_ for the identification of interesting biomarkers, or drug repurposing. 
+
+For the latter, a library of druggable molecules can be screened for good candidates based on the paradigm of "signature reversion"; a good drug candidate should be able to "reverse" the gene activity profile 
+associated with a diseased individual `[Duan et al., 2016] <https://doi.org/10.1038/npjsba.2016.15>`_; `[Delahaye et al., 2016] <https://doi.org/10.1186/s13059-016-1097-7>`_; `[Musa et al., 2018] <https://doi.org/10.1093/bib/bbw112>`_. That is, such a drug would stimulate abnormally weakly activated genes with respect to healthy individuals, and vice-versa. This screening approach may considerably speed up drug development, especially for rare or tropical neglected diseases `[Walker et al., 2021] <https://doi.org/10.1093/cid/ciab350>`_.
+
+However, the construction and analysis of Boolean networks become extremely tedious and time-consuming in the absence of experimental data or when considering the activity of a large number of genes at a time `[Collombet et al., 2017] <https://doi.org/10.1073/pnas.1610622114>`_.
+
+Moreover, the identification of interesting drug targets, via the detection of master regulators --genes at the top of the regulatory hierarchy-- suffers from, first, not exploiting the full network topology, and thus, being oblivious to transcriptional regulatory cascades, which might account for toxic unexpected side effects `[Bolouri et al., 2003] <https://doi.org/10.1073/pnas.1533293100>`_; `[Huang et al., 2019] <https://doi.org/10.1038/s41598-019-54180-4>`_. Second, those detection methods might not take into account the gene activity information relative to diseased patients. 
+
+Finally, regulatory mechanisms at (post-)transcriptomic level are inherently stochastic `[Raj and Van Oudenaarden, 2008] <https://doi.org/10.1016/j.cell.2008.09.050>`_. As a consequence, naive algorithms for Boolean network-based *in silico* drug repurposing rely on testing a given drug a large number of times, in order to get a good estimate of its effect on gene activity. Such methods might resort to the simulation of drug treatment on Boolean network in either a patient-specific approach `[Montagud et al., 2022] <https://doi.org/10.1101/2020.05.27.119016>`_, or by ignoring the stochastic part of gene regulation. In both cases, this might incur a potential loss of robustness in the recommendations. Indeed, those approaches do not provide clear guarantees on the probability of error in recommendation, and might not be sample-efficient. In addition, they do not take advantage of supplementary information on drugs which might help to test drugs more efficiently (e.g., leveraging similarities between drugs in terms of effects on gene activity to infer their effect on gene activity).
+
+Statement of need
+::::::::::::::::::
+
+As a general rule, the development of **NORDic** relies on avoiding *ad hoc* solutions, by implementation of approaches which are relevant to all kinds of diseases regardless of the level of knowledge present in the literature --contrary to approaches which rely on knowing the relation between membrane receptors and a set of genes which activity characterizes the presence of the disease, for instance "Causal Reasoning Analytical Framework for Target discovery" (CRAFT) `[Srivastava et al., 2018] <https://doi.org/10.1038/s41467-018-06008-4>`_. Solutions proposed in this package emphasize on, first, the modularity of the methods, by providing functions which can tackle different types of regulatory dynamics for instance; second, on the transparency of the approaches, by allowing the finetuning of each method through parameters with a clearly defined impact on the result.
+
+Automated identification of disease-related Boolean networks (NORDic NI)
+:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+Most prior works about building Boolean networks assume the existence of either a Prior Knowledge Network (PKN) --that is, a preselected set of known regulatory interactions among genes of interest-- and/or a set of perturbation experiments, where the gene activity of a subset of genes is measured after a single gene perturbation. Those works then propose approaches to infer in an automated way a Boolean network based on these data (e.g., PROFILE `[Beal et al., 2021] <https://doi.org/10.1371/journal.pcbi.1007900>`_, or BoolNet `[Mussel, Hopfensitz and Kestler, 2010] <https://doi.org/10.1093/bioinformatics/btq124>`_), by studying gene activity correlations (ARACNE `[Margolin et al., 2006] <https://doi.org/10.1186/1471-2105-7-S1-S7>`_, *parmigene* `[Sales and Romualdi, 2011] <https://doi.org/10.1093/bioinformatics/btr274>`_), or through answer-set programming, having converted experimental measures into a set of Boolean constraints (BoneSiS `[Chevalier et al., 2019] <https://doi.org/10.1109/ictai.2019.00014>`_, Re:IN `[Yordanov et al., 2016] <https://doi.org/10.1038/npjsba.2016.10>`_). However, for rare diseases for instance, pinpointing a subset of genes of interest is already a difficult task by itself.
+
+One approach, called CasQ `[Aghamiri et al., 2020] <https://doi.org/10.1093/bioinformatics/btaa484>`_, has specifically proposed a direct, automated conversion from regulatory maps in the MINERVA database `[Gawron et al., 2016] <https://doi.org/10.1038/npjsba.2016.20>`_ to Boolean networks. However, not only does this method need the definition of prior regulatory maps, but it also relies on automatically assigning gene regulatory functions based on the regulators of each gene according to the map. This automated procedure asserts that a given gene is considered active if and only if every one of its reported activatory regulators is active, and all of its inhibitory regulators are inactive. However, since this choice does not take into account dynamical information from experiments, resulting regulatory functions might impact the quality of gene activity predictions.
+
+Moreover, there exist two hurdles to building Boolean networks which are specific to the Boolean framework. First, gene activity data must be binarized, meaning that one has to decide when a given gene is considered active or inactive in each sample. Such a process leads to an unavoidable loss of information. In order to avoid bias in the inference process, this step should be data-driven and user-controlled. For instance, when using PROFILE `[Beal et al., 2021] <https://doi.org/10.1371/journal.pcbi.1007900>`_, a majority of genes might end up with an undetermined status --meaning that they are considered neither significantly strongly nor weakly active-- which considerably undermines the input from experimental constraints. 
+
+Second, the problem of identification of a Boolean network is usually underdetermined, as there is too few of experiments and measurements in practice, compared to the size of the considered gene set.
+
+Module **NORDic Network Identification (NI)** addresses these issues in an automated and user-controllable manner, by performing information extraction from large online sources of biological data, and data quality filtering according to user-selected parameters, which control every step of the process. As such, the hope is that **NORDic** makes the generation of disease-specific Boolean networks easier, reproducible, even in the absence of previously curated experiments, prior knowledge networks, or even a set of disease-associated genes. The pipeline implemented in **NORDic** was applied to epilepsy in a preliminary work `[Reda and Delahaye-Duriez, 2022] <https://doi.org/10.1007/978-3-031-15034-0_5>`_.
+
+Prioritization of master regulators in Boolean networks (NORDic PMR)
+:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+The identification of master regulators might relate to the disease onset or affected biological pathways of interest. Some methods targeted at their detection emphasize on the centrality of the location of the gene in the network, e.g., by computing a centrality-associated value for each gene in the network, and recommending top genes. For instance, one might compute the outgoing degrees using built-in application Network Analysis in Cytoscape `[Shannon et al., 2003] <https://doi.org/10.1101/gr.1239303>`_, or the Control Centrality value `[Liu et al., 2012] <https://doi.org/10.1371/journal.pone.0044459>`_ using the CytoCtrlAnalyser `[Wu et al., 2018] <https://doi.org/10.1093/bioinformatics/btx764>`_ application. Yet those functions only leverage topological information about the network, and do not take into account gene activity data related to the disease. That is, the gene activity context does not impact the genewise values computed on the network.
+
+A notable exception is the work by `[Zerrouck et al., 2020] <https://doi.org/10.1038/s41598-020-73147-4>`_, which considers gene activity data from patients afflicted with rheumatoid arthritis, and compute a gene activity-based influence score using tool CoRegNet `[Nicolle, Radvanyi and Elati, 2015] <https://doi.org/10.1093/bioinformatics/btv305>`_. However, that computation does not take into account downstream transcriptional cascades `[Bolouri et al., 2003] <https://doi.org/10.1073/pnas.1533293100>`_, that is, regulatory effects which trickle down the network, beyond the targets directly regulated by the gene. 
+
+Module **NORDic PMR** detects master regulators in a Boolean network, given examples of gene activity profiles from patients. In contrast to prior works, the score assigned to (groups of) master regulators takes into account the network topology as well as its dynamics with respect to the diseased profiles. The approach, based on a machine learning algorithm solving the influence maximization problem `[Kempe et al, 2003] <https://doi.org/10.1145/956750.956769>`_, is described in `[Reda and Delahaye-Duriez, 2022] <https://doi.org/10.1007/978-3-031-15034-0_5>`_.
+
+Novel approaches for scoring drug effects & repurposing drugs (NORDic DS & DR)
+:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+**NORDic** also proposes to tackle two problems related to drug repurposing: first, drug scoring, based on its ability to reverse the diseased gene activity profile (**NORDic DS**); second, the computation of an online sampling procedure which determines which drugs to test during drug screening for repurposing, in order to guarantee a bound on the error in recommendation, while remaining as sample-efficient as possible (**NORDic DR**).
+
+There exist other approaches performing signature reversion, as mentioned in introduction. However, module **NORDic DS** (since version 2.0 of **NORDic**) is the first package to implement drug scoring based on Boolean networks, which can apply to any disease --for instance, it does not need the definition of specific biological phenotypes that should 
+be observed after exposure to treatment `[Montagud et al., 2022] <https://doi.org/10.1101/2020.05.27.119016>`_. The method implemented in **NORDic DS** is described in `[Reda, 2022] <https://hal.science/tel-03846072/file/REDA_PhD.pdf>`_.
+
+Similarly, module **NORDic DR** is the first approach that aims at solving the lack of guarantees in recommendation error. **NORDic DR** relies on bandit algorithms, which are sequential reinforcement learning algorithms that enable the recommendation of most efficient drugs. Based on Boolean network simulations performed on the fly, those algorithms can adaptively select the next drug to test in order to perform recommendations with as few samples as possible. Algorithms implemented in **NORDic DR** are described and theoretically analyzed in `[Réda, Kaufmann and Delahaye-Duriez, 2021] <https://doi.org/10.48550/arXiv.2103.10070>`_ (for the *m-LinGapE* algorithm), and in `[Réda, Tirinzoni and Degenne, 2021] <https://doi.org/10.48550/arXiv.2111.01479>`_ (*MisLid* algorithm).
+
+Extraction of information from large public data sets & simulation module (NORDic UTILS)
+:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+In all four present modules in **NORDic**, helper functions in module **NORDic UTILS** are implemented in order to extract and curate data in a transparent way from the LINCS L1000 `[Subramanian et al., 2017] <https://doi.org/10.1016/j.cell.2017.10.049>`_, OmniPath `[Turei et al., 2016] <https://doi.org/10.1038/nmeth.4077>`_, DisGeNet `[Pinero et al., 2016] <https://doi.org/10.1093/nar/gkw943>`_ and STRING `[Szklarczyk et al., 2021] <https://doi.org/10.1093/nar/gkaa1074>`_ databases. **NORDic** also proposes a simulation module, which allows to test the accuracy of the predictions made by the network compared to known measurements. This module also enables the study and the visualization of the behaviour of the network under various perturbations and types of regulatory dynamics.
+
+Summary
+::::::::::
+
+Building a representation of gene interactions and their influences on gene activity, in an automated and reproducible way, helps to model more complex diseases and biological phenomena on a larger set of genes. These models might speed up the understanding of the gene regulation hierarchy by bioinformaticians and biologists; and allow to predict novel drugs or gene targets which might be investigated later for healthcare purposes. In particular, the network-oriented approach might be able to predict off-targets. The **NORDic** Python package aims at tackling those problems, with a focus on reproducibility and modularity. It primarily relies on popular formats for network description files, such 
+as the .bnet format. Moreover, **NORDic** enables further study of the network in Cytoscape, by providing a direct conversion to .sif formats, along with a dedicated style file. The different pipelines present in **NORDic** produce intermediary files, which might be checked by the user, and can be fed again to the pipeline in order to reproduce the results.
+
+To get started with the different modules proposed in **NORDic**, please check out the tutorials (Jupyter notebooks) on the GitHub repository `[Reda et Delahaye-Duriez, 2023] <https://doi.org/10.5281/zenodo.7239047>`_, which provides an application to a disease called Congenital Central Hypoventilation Syndrome (CCHS).
-Original file line number
+Diff line change
@@ Expand Up @@
     .. toctree::
        :maxdepth: 4
+       paper
        install
        content
        modules
@@ Expand Down @@