Update conf.py and CONTRIBUTING.rst and test

breimanntools · Sep 19, 2023 · 18edaeb · 18edaeb
1 parent c376adb
commit 18edaeb
Show file tree

Hide file tree

Showing 115 changed files with 1,306 additions and 401 deletions.
diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -1,3 +1,9 @@
+.. Developer Notes:
+    This fill is a summary of sources and conventions for software development in Python used in this project.
+    Aims of the project and our naming conventions are defined in 'Vision' and 'Documentation', respectively.
+    Please modify only CONTRIBUTING.rst and just update by copy-pasting the /docs/source/index/CONTRIBUTING_COPY.rst,
+    which is used for the readthedocs documentation.
+
 ============
 Contributing
 ============
@@ -7,7 +13,7 @@ Contributing
   :depth: 1
 
 Introduction
-------------
+============
 
 Welcome and thank you for considering a contribution to AAanalysis! We are an open-source project focusing on
 interpretable protein prediction. Your involvement is invaluable to us. Contributions can be made in the following ways:
@@ -17,13 +23,13 @@ interpretable protein prediction. Your involvement is invaluable to us. Contribu
 - Participating in project discussions.
 
 Newcomers can start by tackling issues labeled `good first issue <https://github.com/breimanntools/aaanalysis/issues>`_.
-
+Please email stephanbreimann@gmail.com for further questions or suggestions?
 
 Vision
-------
+======
 
 Objectives
-^^^^^^^^^^
+----------
 
 - Establish a comprehensive toolkit for interpretable, sequence-based protein prediction.
 - Enable robust learning from small and unbalanced datasets, common in life sciences.
@@ -32,14 +38,14 @@ Objectives
 - Offer flexible interoperability with other Python packages like `biopython <https://biopython.org/>`_.
 
 Non-goals
-^^^^^^^^^
+---------
 
 - Reimplementation of existing solutions.
 - Ignoring the biological context.
 - Reliance on opaque, black-box models.
 
 Principles
-^^^^^^^^^^
+----------
 
 - Algorithms should be biologically inspired and combine empirical insights with cutting-edge computational methods.
 - We emphasize fair, accountable, and transparent machine learning, as detailed
@@ -49,7 +55,7 @@ Principles
 
 
 Bug Reports
------------
+===========
 
 For effective bug reports, please include a Minimal Reproducible Example (MRE):
 
@@ -61,10 +67,10 @@ Further guidelines can be found `here <https://matthewrocklin.com/minimal-bug-re
 
 
 Installation
-------------
+============
 
 Latest Version
-^^^^^^^^^^^^^^
+--------------
 
 To test the latest development version, you can use pip:
 
@@ -73,7 +79,7 @@ To test the latest development version, you can use pip:
   pip install git+https://github.com/breimanntools/aaanalysis.git@master
 
 Local Development Environment
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------------------------
 
 Fork and Clone
 """"""""""""""
@@ -122,7 +128,7 @@ This will execute all the test cases in the tests/ directory.
 
 
 Pull Requests
--------------
+=============
 
 For substantial changes, start by opening an issue for discussion. For minor changes like typos, submit a pull request directly.
 
@@ -133,21 +139,21 @@ Ensure your pull request:
 - Is up-to-date with the master branch and passes all tests.
 
 Preview Changes
-^^^^^^^^^^^^^^^
+---------------
 
 To preview documentation changes in pull requests, follow the "docs/readthedocs.org" check link under "All checks have passed".
 
 
 Documentation
--------------
+=============
 
 Documentation is a crucial part of the project. If you make any modifications to the documentation,
 please ensure they render correctly.
 
 Naming Conventions
-^^^^^^^^^^^^^^^^^^^
+------------------
 
-We strive for interface consistency with well-established libraries like
+We strive for consistency of our public interfaces with well-established libraries like
 `scikit-learn <https://scikit-learn.org/stable/>`_, `pandas <https://pandas.pydata.org/>`_,
 `matplotlib <https://matplotlib.org/>`_, and `seaborn <https://seaborn.pydata.org/>`_.
 
@@ -162,7 +168,11 @@ We primarily use two class templates for organizing our codebase:
 - **Tool**: Standalone classes that focus on specialized tasks, such as feature engineering for protein prediction.
   They feature `.run` and `.eval` methods to carry out the complete processing pipeline and generate various evaluation metrics.
 
-Both `Wrapper` and `Tool` classes come with supplementary plotting classes for visualization.
+The remaining classes should fulfill two further purposes, without being directly implemented using class inheritance.
+
+- **Data visualization**: Supplementary plotting classes for `Wrapper` and `Tool` classes, named accordingly using a
+  `Plot` suffix (e.g., 'CPPPlot'). These classes implement an `.eval` method to visualize the key evaluation measures.
+- **Analysis support**: Supportive pre-processing classes  for `Wrapper` and `Tool` classes.
 
 Function and Method Naming
 """"""""""""""""""""""""""
@@ -172,30 +182,38 @@ processing data values should correspond with the names specified in our primary
 `aaanalysis/_utils/_utils_constants.py`.
 
 Code Philosophy
-^^^^^^^^^^^^^^^
+---------------
 
 We aim for a modular, robust, and easily extendable codebase. Therefore, we adhere to flat class hierarchies
 (i.e., only inheriting from `Wrapper` or `Tool` is recommended) and functional programming principles, as outlined in
 `A Philosophy of Software Design <https://dl.acm.org/doi/10.5555/3288797>`_.
-We also prioritize user-friendly interfaces, complete with descriptive error messages and
-`Python type hints <https://docs.python.org/3/library/typing.html>`_, comprehensively described in
-`Robust Python <https://www.oreilly.com/library/view/robust-python/9781098100650/>`_.
+Our goal is to provide a user-friendly public interface using concise description and
+`Python type hints <https://docs.python.org/3/library/typing.html>`_ (see also this Python Enhancement Proposal
+`PEP 484 <https://peps.python.org/pep-0484/>`_
+or the `Robust Python <https://www.oreilly.com/library/view/robust-python/9781098100650/>`_ book).
+For the validation of user inputs, we use comprehensive checking functions with descriptive error messages.
 
 Documentation Style
-^^^^^^^^^^^^^^^^^^^^
+-------------------
+
+- **Docstring Style**: We use the `Numpy Docstring style <https://numpydoc.readthedocs.io/en/latest/format.html>`_ and
+  adhere to the `PEP 257 <https://peps.python.org/pep-0257/>`_ docstring conventions.
 
-- **Docstring Style**: We use Numpy-style docstrings. Learn more in the
-  `Numpy Docstring Guide <https://numpydoc.readthedocs.io/en/latest/format.html>`_.
+- **Code Style**: Please follow the `PEP 8 <https://peps.python.org/pep-0008/>`_ and
+  `PEP 20 <https://peps.python.org/pep-0020/>`_ style guides for Python code.
 
 - **Markup Language**: Documentation is in reStructuredText (.rst). For an introduction, see this
   `reStructuredText Primer <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`_.
 
-- **Autodoc**: We use `sphinx.ext.autodoc` for automatic inclusion of docstrings in the documentation.
+- **Autodoc**: We use `Sphinx <https://www.sphinx-doc.org/en/master/index.html>`_
+  for automatic inclusion of docstrings in the documentation, including its
+  `autodoc <https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html>`_ and
+  `napoleon <https://sphinxcontrib-napoleon.readthedocs.io/en/latest/#>`_ extensions.
 
 - **Further Details**: See `docs/source/conf.py` for more.
 
 Building the Docs
-^^^^^^^^^^^^^^^^^
+-----------------
 
 To generate the documentation locally:
 

diff --git a/README.rst b/README.rst
@@ -1,5 +1,7 @@
 Welcome to the AAanalysis documentation
 =======================================
+.. Developer Notes:
+    Please update badges in README.rst and vice versa
 .. image:: https://github.com/breimanntools/aaanalysis/workflows/Build/badge.svg
    :target: https://github.com/breimanntools/aaanalysis/actions
    :alt: Build Status

diff --git a/aaanalysis/__pycache__/utils.cpython-39.pyc b/aaanalysis/__pycache__/utils.cpython-39.pyc
diff --git a/aaanalysis/_utils/utils_dpulearn.py b/aaanalysis/_utils/utils_dpulearn.py
@@ -17,15 +17,3 @@
 
 
 # III Test/Caller Functions
-
-
-# IV Main
-def main():
-    t0 = time.time()
-
-    t1 = time.time()
-    print("Time:", t1 - t0)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/aaanalysis/data_loader/__pycache__/data_loader.cpython-39.pyc b/aaanalysis/data_loader/__pycache__/data_loader.cpython-39.pyc
diff --git a/aaanalysis/data_loader/data_loader.py b/aaanalysis/data_loader/data_loader.py
@@ -1,5 +1,6 @@
 """
-This is a script for loading protein sequence benchmarking datasets and amino acid scales including classification
+This is a script for loading protein sequence benchmarking datasets and amino acid scales and
+their two-level classification (AAontology).
 """
 import os
 import pandas as pd
@@ -12,8 +13,8 @@
 # I Helper Functions
 STR_AA_GAP = "-"
 LIST_CANONICAL_AA = ['N', 'A', 'I', 'V', 'K', 'Q', 'R', 'M', 'H', 'F', 'E', 'D', 'C', 'G', 'L', 'T', 'S', 'Y', 'W', 'P']
-LIST_SCALES = [ut.STR_SCALES, ut.STR_SCALES_RAW]
-LIST_DATASETS = LIST_SCALES + [ut.STR_SCALE_CAT, ut.STR_SCALES_PC, ut.STR_TOP60, ut.STR_TOP60_EVAL]
+NAME_SCALE_SETS_BASE = [ut.STR_SCALES, ut.STR_SCALES_RAW]
+NAMES_SCALE_SETS = NAME_SCALE_SETS_BASE + [ut.STR_SCALE_CAT, ut.STR_SCALES_PC, ut.STR_TOP60, ut.STR_TOP60_EVAL]
 
 
 # II Main Functions
@@ -36,37 +37,49 @@ def _adjust_non_canonical_aa(df=None, non_canonical_aa="remove"):
         df[ut.COL_SEQ] = [re.sub(f'[{"".join(list_non_canonical_aa)}]', STR_AA_GAP, x) for x in df[ut.COL_SEQ]]
     return df
 
+
+def check_name_of_dataset(name="INFO", folder_in=None):
+    """"""
+    if name == "INFO":
+        return
+    list_datasets = [x.split(".")[0] for x in os.listdir(folder_in) if "." in x]
+    if name not in list_datasets:
+        list_aa = [x for x in list_datasets if 'AA' in x]
+        list_seq = [x for x in list_datasets if 'SEQ' in x]
+        list_dom = [x for x in list_datasets if 'DOM' in x]
+        raise ValueError(f"'name' ({name}) is not valid."
+                         f"\n Amino acid datasets: {list_aa}"
+                         f"\n Sequence datasets: {list_seq}"
+                         f"\n Domain datasets: {list_dom}")
+
+
 # TODO write test and check in READTHEDOCA (end-to-end solution)
 def load_dataset(name: str = "INFO",
                  n: Optional[int] = None,
                  non_canonical_aa: Literal["remove", "keep", "gap"] = "remove",
                  min_len: Optional[int] = None,
                  max_len: Optional[int] = None) -> pd.DataFrame:
     """
-    Load protein benchmarking datasets or their general overview by setting 'name' to 'INFO'.
+    Load protein benchmarking datasets.
 
-    Three types of benchmark datasets are provided:
-        - Residue prediction: 6 datasets used to predict residue (amino acid) specific properties
-            ('AA_CASPASE3', 'AA_FURIN', 'AA_LDR', 'AA_MMP2', 'AA_RNABIND', 'AA_SA')
-        - Domain prediction: 1 dataset used to predict domain specific properties (_PU contains unlabeled data)
-            (DOM_GSEC, DOM_GSEC_PU)
-        - Sequence prediction: 6 datasets used to predict sequence specific properties
-            ('SEQ_AMYLO', 'SEQ_CAPSID', 'SEQ_DISULFIDE', 'SEQ_LOCATION', 'SEQ_SOLUBLE', 'SEQ_TAIL')
+    The benchmarks are distinguished into residue/amino acid ('AA'), domain ('DOM'), and sequence ('SEQ') level
+    datasets. An overview table can be retrieved by using default setting (name='INFO'). A through analysis of
+    the residue and sequence datasets can be found in TODO[Breimann23a].
 
     Parameters
     ----------
-    name :
-        Name of the dataset. See 'Dataset' column in overview dataframe (name='INFO').
-    n :
+    name
+        Name of the dataset. See 'Dataset' column in overview table.
+    n
         Number of proteins per class. If None, the whole dataset will be returned.
-    non_canonical_aa :
+    non_canonical_aa
         Options for modifying non-canonical amino acids:
         - 'remove': Sequences containing non-canonical amino acids are removed.
         - 'keep': Sequences containing non-canonical amino acids are not removed.
         - 'gap': Sequences are kept and modified by replacing non-canonical amino acids by gap symbol ('X').
-    min_len :
-        Minimum length of sequences for filtering. None to disable.
-    max_len :
+    min_len
+        Minimum length of sequences for filtering. None to disable
+    max_len
         Maximum length of sequences for filtering. None to disable
 
     Returns
@@ -76,23 +89,16 @@ def load_dataset(name: str = "INFO",
 
     Notes
     -----
-    For further information on the benchmark datasets, refer to the AAclust paper : TODO: add link to AAclust paper
+    See further information on the benchmark datasets in
 
     """
     ut.check_non_negative_number(name="n", val=n, accept_none=True)
     ut.check_non_negative_number(name="min_len", val=min_len, accept_none=True)
     folder_in = ut.FOLDER_DATA + "benchmarks" + ut.SEP
+    check_name_of_dataset(name=name, folder_in=folder_in)
+    # Load overview table
     if name == "INFO":
         return pd.read_excel(folder_in + "INFO_benchmarks.xlsx")
-    list_datasets = [x.split(".")[0] for x in os.listdir(folder_in) if "." in x]
-    if name not in list_datasets:
-        list_aa = [x for x in list_datasets if 'AA' in x]
-        list_seq = [x for x in list_datasets if 'SEQ' in x]
-        list_dom = [x for x in list_datasets if 'DOM' in x]
-        raise ValueError(f"'name' ({name}) is not valid."
-                         f"\n Amino acid datasets: {list_aa}"
-                         f"\n Sequence datasets: {list_seq}"
-                         f"\n Domain datasets: {list_dom}")
     df = pd.read_csv(folder_in + name + ".tsv", sep="\t")
     # Filter Rdata
     if min_len is not None:
@@ -128,12 +134,14 @@ def _filter_scales(df_cat=None, unclassified_in=False, just_aaindex=False):
 # Extend for AAclustTop60
 def load_scales(name="scales", just_aaindex=False, unclassified_in=True):
     """
-    Load amino acid scales or scale classification.
+    Load amino acid scales, scale classification (AAontology), or scale evaluation.
+
+    A through analysis of the residue and sequence datasets can be found in TODO[Breimann23a].
 
     Parameters
     ----------
     name : str, default = 'scales'
-        Name of the dataset to load. Options are 'scales', 'scales_raw', 'scale_classification',
+        Name of the dataset to load. Options are 'scales', 'scales_raw', 'scale_cat',
         'scales_pc', 'top60', and 'top60_eval'.
     unclassified_in : bool, optional
         Whether unclassified scales should be included. The 'Others' category counts as unclassified.
@@ -147,15 +155,15 @@ def load_scales(name="scales", just_aaindex=False, unclassified_in=True):
     df : :class:`pandas.DataFrame`
         Dataframe for the selected scale dataset.
     """
-    if name not in LIST_DATASETS:
-        raise ValueError(f"'name' ({name}) is not valid. Choose one of following: {LIST_DATASETS}")
-    # Load data
+    if name not in NAMES_SCALE_SETS:
+        raise ValueError(f"'name' ({name}) is not valid. Choose one of following: {NAMES_SCALE_SETS}")
+    # Load _data
     df_cat = pd.read_excel(ut.FOLDER_DATA + f"{ut.STR_SCALE_CAT}.xlsx")
     df_cat = _filter_scales(df_cat=df_cat, unclassified_in=unclassified_in, just_aaindex=just_aaindex)
     if name == ut.STR_SCALE_CAT:
         return df_cat
     df = pd.read_excel(ut.FOLDER_DATA + name + ".xlsx", index_col=0)
     # Filter scales
-    if name in LIST_SCALES:
+    if name in NAME_SCALE_SETS_BASE:
         df = df[[x for x in list(df) if x in list(df_cat[ut.COL_SCALE_ID])]]
     return df
diff --git a/aaanalysis/utils.py b/aaanalysis/utils.py
@@ -1,5 +1,5 @@
 """
-Config with folder structure
+Config with folder structure. Most imported modules contain checking functions for code validation
 """
 import os
 import platform
@@ -28,3 +28,23 @@ def _folder_path(super_folder, folder_name):
 
 
 # II MAIN FUNCTIONS
+# Check key dataframes using constants and general checking functions
+def check_df_seq():
+    """"""
+
+
+def check_df_parts():
+    """"""
+
+
+def check_df_cat():
+    """"""
+
+
+def check_df_scales():
+    """"""
+
+
+def check_df_feat():
+    """"""
+
diff --git a/docs/build/doctrees/index/badges.doctree → docs/build/doctrees/_index/badges.doctree b/docs/build/doctrees/index/badges.doctree → docs/build/doctrees/_index/badges.doctree
diff --git a/...uild/doctrees/_resources/overview.doctree → docs/build/doctrees/_index/overview.doctree b/...uild/doctrees/_resources/overview.doctree → docs/build/doctrees/_index/overview.doctree
diff --git a/.../build/doctrees/_resources/tables.doctree → docs/build/doctrees/_index/tables.doctree b/.../build/doctrees/_resources/tables.doctree → docs/build/doctrees/_index/tables.doctree
diff --git a/docs/build/doctrees/_index/usage_principles/data_loading.doctree b/docs/build/doctrees/_index/usage_principles/data_loading.doctree
diff --git a/docs/build/doctrees/api.doctree b/docs/build/doctrees/api.doctree
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/generated/aaanalysis.AAclust.doctree b/docs/build/doctrees/generated/aaanalysis.AAclust.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.CPP.doctree b/docs/build/doctrees/generated/aaanalysis.CPP.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.CPPPlot.doctree b/docs/build/doctrees/generated/aaanalysis.CPPPlot.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.SequenceFeature.doctree b/docs/build/doctrees/generated/aaanalysis.SequenceFeature.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.SplitRange.doctree b/docs/build/doctrees/generated/aaanalysis.SplitRange.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.dPULearn.doctree b/docs/build/doctrees/generated/aaanalysis.dPULearn.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.load_dataset.doctree b/docs/build/doctrees/generated/aaanalysis.load_dataset.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.load_scales.doctree b/docs/build/doctrees/generated/aaanalysis.load_scales.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.plot_get_cdict.doctree b/docs/build/doctrees/generated/aaanalysis.plot_get_cdict.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.plot_get_cmap.doctree b/docs/build/doctrees/generated/aaanalysis.plot_get_cmap.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.plot_set_legend.doctree b/docs/build/doctrees/generated/aaanalysis.plot_set_legend.doctree
diff --git a/docs/build/doctrees/generated/aaanalysis.plot_settings.doctree b/docs/build/doctrees/generated/aaanalysis.plot_settings.doctree
diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree
diff --git a/docs/build/doctrees/index/CONTRIBUTING_COPY.doctree b/docs/build/doctrees/index/CONTRIBUTING_COPY.doctree
diff --git a/docs/build/doctrees/index/citations.doctree b/docs/build/doctrees/index/citations.doctree
diff --git a/docs/build/doctrees/index/tables_template.doctree b/docs/build/doctrees/index/tables_template.doctree
diff --git a/docs/build/doctrees/index/usage_principles.doctree b/docs/build/doctrees/index/usage_principles.doctree
diff --git a/docs/build/html/.buildinfo b/docs/build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 65d00fa6b6e9e2b36fd1da4e8b74eb3d
+config: 1ea60009c1a7c0c8dbd40871d65195e6
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/...reviews/summary_index_badges_577f5a73.png → ...eviews/summary__index_badges_577f5a73.png b/...reviews/summary_index_badges_577f5a73.png → ...eviews/summary__index_badges_577f5a73.png
diff --git a/.../summary__resources_overview_2b433d77.png → ...iews/summary__index_overview_2b433d77.png b/.../summary__resources_overview_2b433d77.png → ...iews/summary__index_overview_2b433d77.png
diff --git a/...ws/summary__resources_tables_88a5b382.png → ...eviews/summary__index_tables_88a5b382.png b/...ws/summary__resources_tables_88a5b382.png → ...eviews/summary__index_tables_88a5b382.png
diff --git a/...mages/social_previews/summary__index_usage_principles_data_loading_cc2c81b2.png b/...mages/social_previews/summary__index_usage_principles_data_loading_cc2c81b2.png
diff --git a/docs/build/html/_images/social_previews/summary_index_tables_template_5b05b8f2.png b/docs/build/html/_images/social_previews/summary_index_tables_template_5b05b8f2.png