Back to before REMOVE

breimanntools · Sep 20, 2023 · 279db56 · 279db56
1 parent ad0253e
commit 279db56
Show file tree

Hide file tree

Showing 72 changed files with 2,437 additions and 2 deletions.
diff --git a/aaanalysis/plotting/__init__.py b/aaanalysis/plotting/__init__.py
@@ -0,0 +1,4 @@
+from aaanalysis.plotting.plotting_functions import plot_get_cmap, plot_get_cdict, plot_gcfs, \
+    plot_settings, plot_set_legend
+
+__all__ = ["plot_get_cmap", "plot_get_cdict", "plot_settings", "plot_set_legend", "plot_gcfs"]
diff --git a/aaanalysis/plotting/plotting_functions.py b/aaanalysis/plotting/plotting_functions.py
diff --git a/docs/source/_index/tables.rst b/docs/source/_index/tables.rst
@@ -0,0 +1,246 @@
+..
+    Developer Notes:
+    This is the index file for all tables of the AAanalysis documentation. Each table should be saved the /tables
+    directory. This file will serve as template for tables.rst, which is automatically created on the information
+    provided here and in the .csv tables from the /tables directory. Add a new table as .csv in the /tables directory,
+    in the overview table at the beginning of this document, and a new section with a short description of it in this
+    document. Each column and important data types (e.g., categories) should be described. Each table should contain a
+    'Reference' column.
+    Ignore 'tables_template.rst: WARNING: document isn't included in any toctree' warning
+
+Tables
+======================
+
+.. contents::
+    :local:
+    :depth: 1
+
+Overview Table
+--------------
+All tables from the AAanalysis documentation are given here in chronological order of the project history.
+
+.. _0_mapper:
+.. list-table::
+   :header-rows: 1
+   :widths: 8 8 8
+
+   * - Table
+     - Description
+     - See also
+   * - 1_overview_benchmarks
+     - Protein benchmark datasets
+     - aa.load_dataset
+   * - 2_overview_scales
+     - Amino acid scale datasets
+     - aa.load_scales
+
+
+Protein benchmark datasets
+--------------------------
+Three types of benchmark datasets are provided:
+
+- Residue prediction (AA): Datasets used to predict residue (amino acid) specific properties.
+- Domain prediction (DOM): Dataset used to predict domain specific properties.
+- Sequence prediction (SEQ): Datasets used to predict sequence specific properties.
+
+The classification of each dataset is indicated as first part of their name followed by an abbreviation for the
+specific dataset (e.g., 'AA_LDR', 'DOM_GSEC', 'SEQ_AMYLO'). For some datasets, an additional version of it is provided
+for positive-unlabeled (PU) learning containing only positive (1) and unlabeled (2) data samples, as indicated by
+*dataset_name_PU* (e.g., 'DOM_GSEC_PU').
+
+.. _1_overview_benchmarks:
+.. list-table::
+   :header-rows: 1
+   :widths: 8 8 8 8 8 8 8 8 8 8
+
+   * - Level
+     - Dataset
+     - # Sequences
+     - # Amino acids
+     - # Positives
+     - # Negatives
+     - Predictor
+     - Description
+     - Reference
+     - Label
+   * - Amino acid
+     - AA_CASPASE3
+     - 233
+     - 185605
+     - 705
+     - 184900
+     - PROSPERous
+     - Prediction of caspase-3 cleavage site
+     - :ref:`Song18 <Song18>`
+     - 1 (adjacent to cleavage site), 0 (not adjacent to cleavage site)
+   * - Amino acid
+     - AA_FURIN
+     - 71
+     - 59003
+     - 163
+     - 58840
+     - PROSPERous
+     - Prediction of furin cleavage site
+     - :ref:`Song18 <Song18>`
+     - 1 (adjacent to cleavage site), 0 (not adjacent to cleavage site)
+   * - Amino acid
+     - AA_LDR
+     - 342
+     - 118248
+     - 35469
+     - 82779
+     - IDP-Seq2Seq
+     - Prediction of long intrinsically disordered regions (LDR)
+     - :ref:`Tang20 <Tang20>`
+     - 1 (disordered), 0 (ordered)
+   * - Amino acid
+     - AA_MMP2
+     - 573
+     - 312976
+     - 2416
+     - 310560
+     - PROSPERous
+     - Prediction of Matrix metallopeptidase-2 (MMP2) cleavage site
+     - :ref:`Song18 <Song18>`
+     - 1 (adjacent to cleavage site), 0 (not adjacent to cleavage site)
+   * - Amino acid
+     - AA_RNABIND
+     - 221
+     - 55001
+     - 6492
+     - 48509
+     - GMKSVM-RU
+     - Prediction of RNA-binding protein residues (RBP60 dataset)
+     - :ref:`Yang21 <Yang21>`
+     - 1 (binding), 0 (non-binding)
+   * - Amino acid
+     - AA_SA
+     - 233
+     - 185605
+     - 101082
+     - 84523
+     - PROSPERous
+     - Prediction of solvent accessibility (SA) of residue (AA_CASPASE3 data set)
+     - :ref:`Song18 <Song18>`
+     - 1 (exposed/accessible), 0 (buried/non-accessible)
+   * - Sequence
+     - SEQ_AMYLO
+     - 1414
+     - 8484
+     - 511
+     - 903
+     - ReRF-Pred
+     - Prediction of amyloidognenic regions
+     - :ref:`Teng21 <Teng21>`
+     - 1 (amyloidogenic), 0 (non-amyloidogenic)
+   * - Sequence
+     - SEQ_CAPSID
+     - 7935
+     - 3364680
+     - 3864
+     - 4071
+     - VIRALpro
+     - Prediction of capdsid proteins
+     - :ref:`Galiez16 <Galiez16>`
+     - 1 (capsid protein), 0 (non-capsid protein)
+   * - Sequence
+     - SEQ_DISULFIDE
+     - 2547
+     - 614470
+     - 897
+     - 1650
+     - Dipro
+     - Prediction of disulfide bridges in sequences
+     - :ref:`Cheng06 <Cheng06>`
+     - 1 (sequence with SS bond), 0 (sequence without SS bond)
+   * - Sequence
+     - SEQ_LOCATION
+     - 1835
+     - 732398
+     - 1045
+     - 790
+     - nan
+     - Prediction of subcellular location of protein (cytoplasm vs plasma membrane)
+     - :ref:`Shen19 <Shen19>`
+     - 1 (protein in cytoplasm), 0 (protein in plasma membrane) 
+   * - Sequence
+     - SEQ_SOLUBLE
+     - 17408
+     - 4432269
+     - 8704
+     - 8704
+     - SOLpro
+     - Prediction of soluble and insoluble proteins
+     - :ref:`Magnan09 <Magnan09>`
+     - 1 (soluble), 0 (insoluble)
+   * - Sequence
+     - SEQ_TAIL
+     - 6668
+     - 2671690
+     - 2574
+     - 4094
+     - VIRALpro
+     - Prediction of tail proteins
+     - :ref:`Galiez16 <Galiez16>`
+     - 1 (tail protein), 0 (non-tail protein)
+   * - Domain
+     - DOM_GSEC
+     - 126
+     - 92964
+     - 63
+     - 63
+     - nan
+     - Prediction of gamma-secretase substrates
+     - :ref:`Breimann23c <Breimann23c>`
+     - 1 (substrate), 0 (non-substrate)
+   * - Domain
+     - DOM_GSEC_PU
+     - 694
+     - 494524
+     - 63
+     - 0
+     - nan
+     - Prediction of gamma-secretase substrates (PU dataset)
+     - :ref:`Breimann23c <Breimann23c>`
+     - 1 (substrate), 2 (unknown substrate status)
+
+
+Amino acid scale datasets
+-------------------------
+Different amino acid scale datasets are provided
+
+.. _2_overview_scales:
+.. list-table::
+   :header-rows: 1
+   :widths: 8 8 8 8
+
+   * - Dataset
+     - Description
+     - # Scales
+     - Reference
+   * - scales
+     - Amino acid scales (min-max normalized)
+     - 586
+     - :ref:`Breimann23b <Breimann23b>`
+   * - scales_raw
+     - Amino acid scales (raw values)
+     - 586
+     - :ref:`Kawashima08 <Kawashima08>`
+   * - scales_classification
+     - Classification of scales (Aaontology)
+     - 586
+     - :ref:`Breimann23b <Breimann23b>`
+   * - scales_pc
+     - Principal component (PC) compressed scales
+     - 20
+     - :ref:`Breimann23a <Breimann23a>`
+   * - top60
+     - Top 60 scale subsets
+     - 60
+     - :ref:`Breimann23a <Breimann23a>`
+   * - top60_eval
+     - Evaluation of top 60 scale subsets
+     - 60
+     - :ref:`Breimann23a <Breimann23a>`
+
+
diff --git a/docs/source/_index/tables/0_mapper.xlsx b/docs/source/_index/tables/0_mapper.xlsx
diff --git a/docs/source/_index/tables/1_overview_benchmarks.xlsx b/docs/source/_index/tables/1_overview_benchmarks.xlsx
diff --git a/docs/source/_index/tables/2_overview_scales.xlsx b/docs/source/_index/tables/2_overview_scales.xlsx
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -9,7 +9,7 @@
 
 sys.path.append(os.path.abspath('.'))
 
-#from create_tables_doc import generate_table_rst
+from create_tables_doc import generate_table_rst
 
 # -- Path and Platform setup --------------------------------------------------
 SEP = "\\" if platform.system() == "Windows" else "/"
@@ -172,7 +172,7 @@
 ]
 
 # Create table.rst
-#generate_table_rst()
+generate_table_rst()
 
 # -- Linkcode configuration ---------------------------------------------------
 _module_path = os.path.dirname(importlib.util.find_spec("aaanalysis").origin)  # type: ignore

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -5,6 +5,7 @@
 
 Welcome to the AAanalysis documentation
 =======================================
+.. include:: index/badges.rst
 .. include:: index/overview.rst
 
 Install
@@ -24,12 +25,14 @@ Install
    :caption: OVERVIEW
 
    index/introduction.rst
+   index/usage_principles.rst
    index/CONTRIBUTING_COPY.rst
 
 .. toctree::
    :maxdepth: 1
    :caption: EXAMPLES
 
+   tutorials.rst
 
 .. toctree::
    :maxdepth: 2
@@ -40,6 +43,7 @@ Install
 .. toctree::
    :maxdepth: 1
 
+   _index/tables.rst
    index/references.rst
 
 Indices and tables

diff --git a/docs/source/index/tables_template.rst b/docs/source/index/tables_template.rst
@@ -0,0 +1,44 @@
+..
+    Developer Notes:
+    This is the index file for all tables of the AAanalysis documentation. Each table should be saved the /tables
+    directory. This file will serve as template for tables.rst, which is automatically created on the information
+    provided here and in the .csv tables from the /tables directory. Add a new table as .csv in the /tables directory,
+    in the overview table at the beginning of this document, and a new section with a short description of it in this
+    document. Each column and important data types (e.g., categories) should be described. Each table should contain a
+    'Reference' column.
+    Ignore 'tables_template.rst: WARNING: document isn't included in any toctree' warning
+
+Tables
+======================
+
+.. contents::
+    :local:
+    :depth: 1
+
+Overview Table
+--------------
+All tables from the AAanalysis documentation are given here in chronological order of the project history.
+
+.. _0_mapper:
+
+Protein benchmark datasets
+--------------------------
+Three types of benchmark datasets are provided:
+
+- Residue prediction (AA): Datasets used to predict residue (amino acid) specific properties.
+- Domain prediction (DOM): Dataset used to predict domain specific properties.
+- Sequence prediction (SEQ): Datasets used to predict sequence specific properties.
+
+The classification of each dataset is indicated as first part of their name followed by an abbreviation for the
+specific dataset (e.g., 'AA_LDR', 'DOM_GSEC', 'SEQ_AMYLO'). For some datasets, an additional version of it is provided
+for positive-unlabeled (PU) learning containing only positive (1) and unlabeled (2) data samples, as indicated by
+*dataset_name_PU* (e.g., 'DOM_GSEC_PU').
+
+.. _1_overview_benchmarks:
+
+Amino acid scale datasets
+-------------------------
+Different amino acid scale datasets are provided
+
+.. _2_overview_scales:
+
diff --git a/docs/source/index/usage_principles.rst b/docs/source/index/usage_principles.rst
@@ -0,0 +1,22 @@
+.. Developer Notes:
+    This is the index file for usage principles. Files for each part are saved in the /usage_principles directory
+    and the overview the AAanalysis package is given as component diagram (internal dependencies) and context diagram
+    (external dependencies). Always give the concise code examples reflecting the usage examples. Instead of including
+    comprehensive tables here, add them in tables.rst and refer to them with a short explanation
+
+Usage Principles
+================
+Import AAanalysis as:
+
+.. code-block:: python
+
+    import aaanalysis as aa
+
+.. toctree::
+   :maxdepth: 1
+
+   usage_principles/data_flow_entry_points
+   usage_principles/aaontology
+   usage_principles/feature_identification
+   usage_principles/pu_learning
+   usage_principles/xai
diff --git a/docs/source/index/usage_principles/aaontology.rst b/docs/source/index/usage_principles/aaontology.rst
@@ -0,0 +1,5 @@
+AAontology: Classification of amino acid scales
+===============================================
+
+AAontology is a two-level classification of amino acid scale, introduced in.
+
diff --git a/docs/source/index/usage_principles/data_flow_entry_points.rst b/docs/source/index/usage_principles/data_flow_entry_points.rst
@@ -0,0 +1,8 @@
+Data Flow and Enry Points
+=========================
+
+The AAanalysis toolkit uses different DataFrames starting from DataFrames containing amino acid scales information
+(df_scales, df_cat) or sequence information (df_seq), which can be modified to obtain specific sequence parts (df_parts).
+Amino acid scales and sequence parts together with split settings are the input for the CPP algorithm, creating
+various physicochemical features (df_feat) by comparing two sets of protein sequences.
+
diff --git a/docs/source/index/usage_principles/feature_identification.rst b/docs/source/index/usage_principles/feature_identification.rst
@@ -0,0 +1,7 @@
+Identifying Physicochemical Signatures using CPP
+================================================
+
+The central algorithm of the AAanalysis platform is Comparative Physicochemical Profiling (CPP), a novel sequence-based
+feature engineering algorithm, designed to enable interpretable protein prediction.
+
+