Skip to content

Commit

Permalink
Update conf.py and CONTRIBUTING.rst and test
Browse files Browse the repository at this point in the history
  • Loading branch information
breimanntools committed Sep 19, 2023
1 parent c376adb commit 18edaeb
Show file tree
Hide file tree
Showing 115 changed files with 1,306 additions and 401 deletions.
68 changes: 43 additions & 25 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
.. Developer Notes:
This fill is a summary of sources and conventions for software development in Python used in this project.
Aims of the project and our naming conventions are defined in 'Vision' and 'Documentation', respectively.
Please modify only CONTRIBUTING.rst and just update by copy-pasting the /docs/source/index/CONTRIBUTING_COPY.rst,
which is used for the readthedocs documentation.
============
Contributing
============
Expand All @@ -7,7 +13,7 @@ Contributing
:depth: 1

Introduction
------------
============

Welcome and thank you for considering a contribution to AAanalysis! We are an open-source project focusing on
interpretable protein prediction. Your involvement is invaluable to us. Contributions can be made in the following ways:
Expand All @@ -17,13 +23,13 @@ interpretable protein prediction. Your involvement is invaluable to us. Contribu
- Participating in project discussions.

Newcomers can start by tackling issues labeled `good first issue <https://github.com/breimanntools/aaanalysis/issues>`_.

Please email stephanbreimann@gmail.com for further questions or suggestions?

Vision
------
======

Objectives
^^^^^^^^^^
----------

- Establish a comprehensive toolkit for interpretable, sequence-based protein prediction.
- Enable robust learning from small and unbalanced datasets, common in life sciences.
Expand All @@ -32,14 +38,14 @@ Objectives
- Offer flexible interoperability with other Python packages like `biopython <https://biopython.org/>`_.

Non-goals
^^^^^^^^^
---------

- Reimplementation of existing solutions.
- Ignoring the biological context.
- Reliance on opaque, black-box models.

Principles
^^^^^^^^^^
----------

- Algorithms should be biologically inspired and combine empirical insights with cutting-edge computational methods.
- We emphasize fair, accountable, and transparent machine learning, as detailed
Expand All @@ -49,7 +55,7 @@ Principles


Bug Reports
-----------
===========

For effective bug reports, please include a Minimal Reproducible Example (MRE):

Expand All @@ -61,10 +67,10 @@ Further guidelines can be found `here <https://matthewrocklin.com/minimal-bug-re


Installation
------------
============

Latest Version
^^^^^^^^^^^^^^
--------------

To test the latest development version, you can use pip:

Expand All @@ -73,7 +79,7 @@ To test the latest development version, you can use pip:
pip install git+https://github.com/breimanntools/aaanalysis.git@master
Local Development Environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-----------------------------

Fork and Clone
""""""""""""""
Expand Down Expand Up @@ -122,7 +128,7 @@ This will execute all the test cases in the tests/ directory.


Pull Requests
-------------
=============

For substantial changes, start by opening an issue for discussion. For minor changes like typos, submit a pull request directly.

Expand All @@ -133,21 +139,21 @@ Ensure your pull request:
- Is up-to-date with the master branch and passes all tests.

Preview Changes
^^^^^^^^^^^^^^^
---------------

To preview documentation changes in pull requests, follow the "docs/readthedocs.org" check link under "All checks have passed".


Documentation
-------------
=============

Documentation is a crucial part of the project. If you make any modifications to the documentation,
please ensure they render correctly.

Naming Conventions
^^^^^^^^^^^^^^^^^^^
------------------

We strive for interface consistency with well-established libraries like
We strive for consistency of our public interfaces with well-established libraries like
`scikit-learn <https://scikit-learn.org/stable/>`_, `pandas <https://pandas.pydata.org/>`_,
`matplotlib <https://matplotlib.org/>`_, and `seaborn <https://seaborn.pydata.org/>`_.

Expand All @@ -162,7 +168,11 @@ We primarily use two class templates for organizing our codebase:
- **Tool**: Standalone classes that focus on specialized tasks, such as feature engineering for protein prediction.
They feature `.run` and `.eval` methods to carry out the complete processing pipeline and generate various evaluation metrics.

Both `Wrapper` and `Tool` classes come with supplementary plotting classes for visualization.
The remaining classes should fulfill two further purposes, without being directly implemented using class inheritance.

- **Data visualization**: Supplementary plotting classes for `Wrapper` and `Tool` classes, named accordingly using a
`Plot` suffix (e.g., 'CPPPlot'). These classes implement an `.eval` method to visualize the key evaluation measures.
- **Analysis support**: Supportive pre-processing classes for `Wrapper` and `Tool` classes.

Function and Method Naming
""""""""""""""""""""""""""
Expand All @@ -172,30 +182,38 @@ processing data values should correspond with the names specified in our primary
`aaanalysis/_utils/_utils_constants.py`.

Code Philosophy
^^^^^^^^^^^^^^^
---------------

We aim for a modular, robust, and easily extendable codebase. Therefore, we adhere to flat class hierarchies
(i.e., only inheriting from `Wrapper` or `Tool` is recommended) and functional programming principles, as outlined in
`A Philosophy of Software Design <https://dl.acm.org/doi/10.5555/3288797>`_.
We also prioritize user-friendly interfaces, complete with descriptive error messages and
`Python type hints <https://docs.python.org/3/library/typing.html>`_, comprehensively described in
`Robust Python <https://www.oreilly.com/library/view/robust-python/9781098100650/>`_.
Our goal is to provide a user-friendly public interface using concise description and
`Python type hints <https://docs.python.org/3/library/typing.html>`_ (see also this Python Enhancement Proposal
`PEP 484 <https://peps.python.org/pep-0484/>`_
or the `Robust Python <https://www.oreilly.com/library/view/robust-python/9781098100650/>`_ book).
For the validation of user inputs, we use comprehensive checking functions with descriptive error messages.

Documentation Style
^^^^^^^^^^^^^^^^^^^^
-------------------

- **Docstring Style**: We use the `Numpy Docstring style <https://numpydoc.readthedocs.io/en/latest/format.html>`_ and
adhere to the `PEP 257 <https://peps.python.org/pep-0257/>`_ docstring conventions.

- **Docstring Style**: We use Numpy-style docstrings. Learn more in the
`Numpy Docstring Guide <https://numpydoc.readthedocs.io/en/latest/format.html>`_.
- **Code Style**: Please follow the `PEP 8 <https://peps.python.org/pep-0008/>`_ and
`PEP 20 <https://peps.python.org/pep-0020/>`_ style guides for Python code.

- **Markup Language**: Documentation is in reStructuredText (.rst). For an introduction, see this
`reStructuredText Primer <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`_.

- **Autodoc**: We use `sphinx.ext.autodoc` for automatic inclusion of docstrings in the documentation.
- **Autodoc**: We use `Sphinx <https://www.sphinx-doc.org/en/master/index.html>`_
for automatic inclusion of docstrings in the documentation, including its
`autodoc <https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html>`_ and
`napoleon <https://sphinxcontrib-napoleon.readthedocs.io/en/latest/#>`_ extensions.

- **Further Details**: See `docs/source/conf.py` for more.

Building the Docs
^^^^^^^^^^^^^^^^^
-----------------

To generate the documentation locally:

Expand Down
2 changes: 2 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
Welcome to the AAanalysis documentation
=======================================
.. Developer Notes:
Please update badges in README.rst and vice versa
.. image:: https://github.com/breimanntools/aaanalysis/workflows/Build/badge.svg
:target: https://github.com/breimanntools/aaanalysis/actions
:alt: Build Status
Expand Down
Binary file modified aaanalysis/__pycache__/utils.cpython-39.pyc
Binary file not shown.
12 changes: 0 additions & 12 deletions aaanalysis/_utils/utils_dpulearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,3 @@


# III Test/Caller Functions


# IV Main
def main():
t0 = time.time()

t1 = time.time()
print("Time:", t1 - t0)


if __name__ == "__main__":
main()
Binary file modified aaanalysis/data_loader/__pycache__/data_loader.cpython-39.pyc
Binary file not shown.
76 changes: 42 additions & 34 deletions aaanalysis/data_loader/data_loader.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""
This is a script for loading protein sequence benchmarking datasets and amino acid scales including classification
This is a script for loading protein sequence benchmarking datasets and amino acid scales and
their two-level classification (AAontology).
"""
import os
import pandas as pd
Expand All @@ -12,8 +13,8 @@
# I Helper Functions
STR_AA_GAP = "-"
LIST_CANONICAL_AA = ['N', 'A', 'I', 'V', 'K', 'Q', 'R', 'M', 'H', 'F', 'E', 'D', 'C', 'G', 'L', 'T', 'S', 'Y', 'W', 'P']
LIST_SCALES = [ut.STR_SCALES, ut.STR_SCALES_RAW]
LIST_DATASETS = LIST_SCALES + [ut.STR_SCALE_CAT, ut.STR_SCALES_PC, ut.STR_TOP60, ut.STR_TOP60_EVAL]
NAME_SCALE_SETS_BASE = [ut.STR_SCALES, ut.STR_SCALES_RAW]
NAMES_SCALE_SETS = NAME_SCALE_SETS_BASE + [ut.STR_SCALE_CAT, ut.STR_SCALES_PC, ut.STR_TOP60, ut.STR_TOP60_EVAL]


# II Main Functions
Expand All @@ -36,37 +37,49 @@ def _adjust_non_canonical_aa(df=None, non_canonical_aa="remove"):
df[ut.COL_SEQ] = [re.sub(f'[{"".join(list_non_canonical_aa)}]', STR_AA_GAP, x) for x in df[ut.COL_SEQ]]
return df


def check_name_of_dataset(name="INFO", folder_in=None):
""""""
if name == "INFO":
return
list_datasets = [x.split(".")[0] for x in os.listdir(folder_in) if "." in x]
if name not in list_datasets:
list_aa = [x for x in list_datasets if 'AA' in x]
list_seq = [x for x in list_datasets if 'SEQ' in x]
list_dom = [x for x in list_datasets if 'DOM' in x]
raise ValueError(f"'name' ({name}) is not valid."
f"\n Amino acid datasets: {list_aa}"
f"\n Sequence datasets: {list_seq}"
f"\n Domain datasets: {list_dom}")


# TODO write test and check in READTHEDOCA (end-to-end solution)
def load_dataset(name: str = "INFO",
n: Optional[int] = None,
non_canonical_aa: Literal["remove", "keep", "gap"] = "remove",
min_len: Optional[int] = None,
max_len: Optional[int] = None) -> pd.DataFrame:
"""
Load protein benchmarking datasets or their general overview by setting 'name' to 'INFO'.
Load protein benchmarking datasets.
Three types of benchmark datasets are provided:
- Residue prediction: 6 datasets used to predict residue (amino acid) specific properties
('AA_CASPASE3', 'AA_FURIN', 'AA_LDR', 'AA_MMP2', 'AA_RNABIND', 'AA_SA')
- Domain prediction: 1 dataset used to predict domain specific properties (_PU contains unlabeled data)
(DOM_GSEC, DOM_GSEC_PU)
- Sequence prediction: 6 datasets used to predict sequence specific properties
('SEQ_AMYLO', 'SEQ_CAPSID', 'SEQ_DISULFIDE', 'SEQ_LOCATION', 'SEQ_SOLUBLE', 'SEQ_TAIL')
The benchmarks are distinguished into residue/amino acid ('AA'), domain ('DOM'), and sequence ('SEQ') level
datasets. An overview table can be retrieved by using default setting (name='INFO'). A through analysis of
the residue and sequence datasets can be found in TODO[Breimann23a].
Parameters
----------
name :
Name of the dataset. See 'Dataset' column in overview dataframe (name='INFO').
n :
name
Name of the dataset. See 'Dataset' column in overview table.
n
Number of proteins per class. If None, the whole dataset will be returned.
non_canonical_aa :
non_canonical_aa
Options for modifying non-canonical amino acids:
- 'remove': Sequences containing non-canonical amino acids are removed.
- 'keep': Sequences containing non-canonical amino acids are not removed.
- 'gap': Sequences are kept and modified by replacing non-canonical amino acids by gap symbol ('X').
min_len :
Minimum length of sequences for filtering. None to disable.
max_len :
min_len
Minimum length of sequences for filtering. None to disable
max_len
Maximum length of sequences for filtering. None to disable
Returns
Expand All @@ -76,23 +89,16 @@ def load_dataset(name: str = "INFO",
Notes
-----
For further information on the benchmark datasets, refer to the AAclust paper : TODO: add link to AAclust paper
See further information on the benchmark datasets in
"""
ut.check_non_negative_number(name="n", val=n, accept_none=True)
ut.check_non_negative_number(name="min_len", val=min_len, accept_none=True)
folder_in = ut.FOLDER_DATA + "benchmarks" + ut.SEP
check_name_of_dataset(name=name, folder_in=folder_in)
# Load overview table
if name == "INFO":
return pd.read_excel(folder_in + "INFO_benchmarks.xlsx")
list_datasets = [x.split(".")[0] for x in os.listdir(folder_in) if "." in x]
if name not in list_datasets:
list_aa = [x for x in list_datasets if 'AA' in x]
list_seq = [x for x in list_datasets if 'SEQ' in x]
list_dom = [x for x in list_datasets if 'DOM' in x]
raise ValueError(f"'name' ({name}) is not valid."
f"\n Amino acid datasets: {list_aa}"
f"\n Sequence datasets: {list_seq}"
f"\n Domain datasets: {list_dom}")
df = pd.read_csv(folder_in + name + ".tsv", sep="\t")
# Filter Rdata
if min_len is not None:
Expand Down Expand Up @@ -128,12 +134,14 @@ def _filter_scales(df_cat=None, unclassified_in=False, just_aaindex=False):
# Extend for AAclustTop60
def load_scales(name="scales", just_aaindex=False, unclassified_in=True):
"""
Load amino acid scales or scale classification.
Load amino acid scales, scale classification (AAontology), or scale evaluation.
A through analysis of the residue and sequence datasets can be found in TODO[Breimann23a].
Parameters
----------
name : str, default = 'scales'
Name of the dataset to load. Options are 'scales', 'scales_raw', 'scale_classification',
Name of the dataset to load. Options are 'scales', 'scales_raw', 'scale_cat',
'scales_pc', 'top60', and 'top60_eval'.
unclassified_in : bool, optional
Whether unclassified scales should be included. The 'Others' category counts as unclassified.
Expand All @@ -147,15 +155,15 @@ def load_scales(name="scales", just_aaindex=False, unclassified_in=True):
df : :class:`pandas.DataFrame`
Dataframe for the selected scale dataset.
"""
if name not in LIST_DATASETS:
raise ValueError(f"'name' ({name}) is not valid. Choose one of following: {LIST_DATASETS}")
# Load data
if name not in NAMES_SCALE_SETS:
raise ValueError(f"'name' ({name}) is not valid. Choose one of following: {NAMES_SCALE_SETS}")
# Load _data
df_cat = pd.read_excel(ut.FOLDER_DATA + f"{ut.STR_SCALE_CAT}.xlsx")
df_cat = _filter_scales(df_cat=df_cat, unclassified_in=unclassified_in, just_aaindex=just_aaindex)
if name == ut.STR_SCALE_CAT:
return df_cat
df = pd.read_excel(ut.FOLDER_DATA + name + ".xlsx", index_col=0)
# Filter scales
if name in LIST_SCALES:
if name in NAME_SCALE_SETS_BASE:
df = df[[x for x in list(df) if x in list(df_cat[ut.COL_SCALE_ID])]]
return df
22 changes: 21 additions & 1 deletion aaanalysis/utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Config with folder structure
Config with folder structure. Most imported modules contain checking functions for code validation
"""
import os
import platform
Expand Down Expand Up @@ -28,3 +28,23 @@ def _folder_path(super_folder, folder_name):


# II MAIN FUNCTIONS
# Check key dataframes using constants and general checking functions
def check_df_seq():
""""""


def check_df_parts():
""""""


def check_df_cat():
""""""


def check_df_scales():
""""""


def check_df_feat():
""""""

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified docs/build/doctrees/api.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.AAclust.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.CPP.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.CPPPlot.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.SequenceFeature.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.SplitRange.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.dPULearn.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.load_dataset.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.load_scales.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.plot_get_cdict.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.plot_get_cmap.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.plot_set_legend.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/generated/aaanalysis.plot_settings.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index/CONTRIBUTING_COPY.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index/citations.doctree
Binary file not shown.
Binary file added docs/build/doctrees/index/tables_template.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index/usage_principles.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 65d00fa6b6e9e2b36fd1da4e8b74eb3d
config: 1ea60009c1a7c0c8dbd40871d65195e6
tags: 645f666f9bcd5a90fca523b33c5a78b7
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 18edaeb

Please sign in to comment.