Merge pull request #171 from uio-bmi/docs_updates

Documentation updates and bugfixes
uio-bmi · Apr 26, 2024 · c492ef7 · c492ef7
2 parents ae3be16 + ba3604e
commit c492ef7
Show file tree

Hide file tree

Showing 258 changed files with 6,929 additions and 6,011 deletions.
diff --git a/docs/source/_static/files/ml_methods_properties.csv b/docs/source/_static/files/ml_methods_properties.csv
@@ -1,10 +1,14 @@
 ML method,binary classification,multi-class classification,sequence dataset,receptor dataset,repertoire dataset,model selection CV
 AtchleyKmerMILClassifier,✓,✗,✗,✗,✓,✗
+BinaryFeatureClassifier,✓,✗,✓,✗,✗,✗
 DeepRC,✓,✗,✗,✗,✓,✗
 KNN,✓,✓,✓,✓,✓,✓
+KerasSequenceCnn,✓,✗,✓,✗,✗,✗
 LogisticRegression,✓,✓,✓,✓,✓,✓
+PrecomputedKNN,✓,✓,✓,✓,✓,✓
 ProbabalisticBinaryClassifier,✓,✗,✗,✗,✓,✗
 RandomForestClassifier,✓,✓,✓,✓,✓,✓
 ReceptorCNN,✓,✗,✗,✓,✗,✗
+SVC,✓,✓,✓,✓,✓,✓
 SVM,✓,✓,✓,✓,✓,✓
 TCRdistClassifier,✓,✓,✓,✓,✓,✗
diff --git a/docs/source/_static/images/ML_setting.png b/docs/source/_static/images/ML_setting.png
diff --git a/docs/source/_static/images/analysis_paths.png b/docs/source/_static/images/analysis_paths.png
diff --git a/docs/source/_static/images/analysis_paths_receptors.png b/docs/source/_static/images/analysis_paths_receptors.png
diff --git a/docs/source/_static/images/analysis_paths_repertoires.png b/docs/source/_static/images/analysis_paths_repertoires.png
diff --git a/docs/source/_static/images/analysis_paths_sequences.png b/docs/source/_static/images/analysis_paths_sequences.png
diff --git a/docs/source/_static/images/definitions_instructions_overview.png b/docs/source/_static/images/definitions_instructions_overview.png
diff --git a/docs/source/_static/images/dev_docs/run_python_test_in_test.png b/docs/source/_static/images/dev_docs/run_python_test_in_test.png
diff --git a/docs/source/_static/images/receptor_analysis_paths.png b/docs/source/_static/images/receptor_analysis_paths.png
diff --git a/docs/source/_static/images/yaml_structure.png b/docs/source/_static/images/yaml_structure.png
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -50,7 +50,8 @@
     'sphinx.ext.doctest',
     'sphinx.ext.mathjax',
     'sphinx.ext.viewcode',
-    'sphinx_rtd_theme',
+    'sphinx_toolbox.collapse',
+#    'sphinx_rtd_theme',
     'sphinx.ext.napoleon',
     'sphinx.ext.autosectionlabel',
     'sphinx_sitemap'

diff --git a/docs/source/developer_docs.rst b/docs/source/developer_docs.rst
@@ -13,27 +13,29 @@ Developer documentation
 To get started with adding features to immuneML, follow the steps described here. These steps assume you have some experience with object-oriented programming in Python
 and some basic knowledge of how Git works.
 
+  #. Read the :ref:`Information for new developers` and other relevant tutorials as needed.
   #. Follow the guide to :ref:`Set up immuneML for development`.
   #. Create a new branch with a descriptive name from the master branch.
-  #. Implement your changes.
+  #. Implement your changes and add documentation.
   #. Make a pull request to the master branch.
 
 .. toctree::
   :maxdepth: 1
   :caption: Developer tutorials
 
+  developer_docs/info_new_developers.rst
   developer_docs/install_for_development.rst
-  developer_docs/how_to_add_new_ML_method.rst
   developer_docs/how_to_add_new_encoding.rst
+  developer_docs/how_to_add_new_ML_method.rst
   developer_docs/how_to_add_new_report.rst
   developer_docs/how_to_add_new_preprocessing.rst
 
 .. toctree::
   :maxdepth: 1
   :caption: Platform overview
 
-  developer_docs/platform_overview.rst
   developer_docs/data_model.rst
+  developer_docs/execution_flow.rst
 
 .. toctree::
   :maxdepth: 4

diff --git a/docs/source/developer_docs/caching.rst b/docs/source/developer_docs/caching.rst
@@ -0,0 +1,62 @@
+
+
+To prevent recomputing the same result a second time, immuneML uses caching.
+Caching can be applied to methods which compute an (intermediate) result.
+The result is stored to a file, and when the same method call is made, the previously
+stored result is retrieved from the file and returned.
+
+We recommend applying caching to methods which are computationally expensive and may be called
+multiple times in the same way. For example, encoders are a good target for caching as they
+may take long to compute and can be called multiple times on the same data when combined
+with different ML methods. But ML methods typically do not require caching, as you would
+want to apply ML methods with different parameters or to differently encoded data.
+
+
+Any method call in immuneML can be cached as follows:
+
+.. code:: python
+
+    result = CacheHandler.memo_by_params(params = cache_params, fn = lambda: my_method_for_caching(my_method_param1, my_method_param2, ...))
+
+
+The :code:`CacheHandler.memo_by_params` method does the following:
+
+- Using the caching parameters, a unique cache key (random string) is created.
+- CacheHandler checks if there already exists a previously computed result that is associated with this key.
+- If the result exists, the result is returned without (re)computing the method.
+- If the result does not exist, the method is computed, its result is stored using the cache key, and the result is returned.
+
+
+The :code:`lambda` function call simply calls the method to be cached, using any required parameters.
+The :code:`cache_params` represent the unique, immutable parameters used to compute the cache key.
+It should have the following properties:
+
+- It must be a nested tuple containing *only* immutable items such as strings, booleans and integers.
+  It cannot contain mutable items like lists, dictionaries, sets and objects (they all need to be converted nested tuples of immutable items).
+- It should include *every* factor that can contribute to a difference in the results of the computed method.
+  For example, when caching the encode_data step, the following should be included:
+
+  - dataset descriptors (dataset id, example ids, dataset type),
+  - encoding name,
+  - labels,
+  - :code:`EncoderParams.learn_model` if used,
+  - all relevant input parameters to the encoder. Preferentially retrieved automatically (such as by :code:`vars(self)`),
+    as this ensures that if new parameters are added to the encoder, they are always added to the caching params.
+
+For example, :py:obj:`~immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder` computes its
+caching parameters as follows:
+
+  .. code:: python
+
+    def _prepare_caching_params(self, dataset, params: EncoderParams):
+        return (("dataset_identifier", dataset.identifier),
+                ("example_identifiers", tuple(dataset.get_example_ids())),
+                ("dataset_type", dataset.__class__.__name__),
+                ("encoding", OneHotEncoder.__name__),
+                ("labels", tuple(params.label_config.get_labels_by_name())),
+                ("encoding_params", tuple(vars(self).items())))
+
+The construction of caching parameters must be done carefully, as caching bugs are extremely difficult
+to discover. Rather add 'too much' information than too little.
+A missing parameter will not lead to an error, but can result in silently copying over
+results from previous method calls.
diff --git a/docs/source/developer_docs/class_documentation_standards.rst b/docs/source/developer_docs/class_documentation_standards.rst
@@ -0,0 +1,33 @@
+Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes.
+The class docstrings are used to automatically generate the documentation web pages, using Sphinx `reStructuredText <https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html>`_, and should adhere to a standard format:
+
+
+#. A short, general description of the functionality
+
+#. Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.
+
+#. A list of arguments, when applicable. This should follow the format below:
+
+   .. code::
+
+     **Specification arguments:**
+
+     - parameter_name (type): a short description
+
+     - other_paramer_name (type): a short description
+
+#. A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:
+
+   .. code::
+
+      **YAML specification:**
+
+      .. indent with spaces
+      .. code-block:: yaml
+
+          definitions:
+              yaml_keyword: # could be encodings/ml_methods/reports/etc...
+                  my_new_class:
+                      MyNewClass:
+                          parameter_name: 0
+                          other_paramer_name: 1
diff --git a/docs/source/developer_docs/coding_conventions_and_tips.rst b/docs/source/developer_docs/coding_conventions_and_tips.rst
@@ -0,0 +1,10 @@
+.. note::
+  **Coding conventions and tips**
+
+  #. Class names are written in CamelCase
+  #. Class methods are writte in snake_case
+  #. Abstract base classes :code:`MLMethod`, :code:`DatasetEncoder`, and :code:`Report`, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.
+  #. Class methods starting with _underscore are generally considered "private" methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
+  #. When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
+  #. If your class should have any default parameters, they should be defined in a default parameters file under :code:`config/default_params/`.
+  #. Some utility classes are available in the :code:`util` package to provide useful functionalities. For example, :py:obj:`~immuneML.util.ParameterValidator.ParameterValidator` can be used to check user input and generate error messages, or :py:obj:`~immuneML.util.PathBuilder.PathBuilder` can be used to add and remove folders.
diff --git a/docs/source/developer_docs/data_model.rst b/docs/source/developer_docs/data_model.rst
@@ -9,21 +9,67 @@ immuneML data model
    :twitter:title: immuneML dev docs: data model
    :twitter:image: https://docs.immuneml.uio.no/_images/data_model_architecture.png
 
+
+immuneML works with adaptive immune receptor sequencing data.
+Internally, the classes and data structures used to represent this data adheres to the `AIRR Rearrangement Schema <https://docs.airr-community.org/en/stable/datarep/rearrangements.html>`_,
+although it is possible to import data from a wider variety of common formats.
+
+Most immuneML analyses are based on the amino acid CDR3 junction.
+Some analyses also use the V and J gene name ('call') information.
+While importing of full-length (V + CDR3 + J) sequences is supported,
+there are no functionalities in immuneML designed for analysing sequences
+at that level.
+
+An immuneML dataset consists of a set of 'examples'. These examples are the
+
 immuneML data model supports three types of datasets that can be used for analyses:
 
-#. Repertoire dataset (:py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset`) - one example in the dataset is one repertoire typically coming from one subject
-#. Receptor dataset (:py:obj:`~immuneML.data_model.dataset.ReceptorDataset.ReceptorDataset`) - one example is one receptor with both chains set
-#. Sequence dataset (:py:obj:`~immuneML.data_model.dataset.SequenceDataset.SequenceDataset`) - one example is one receptor sequence with single chain information.
+#. Repertoire dataset (:py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset`) - each example in the dataset is a large set of AIR sequences which are typically derived from one subject (individual).
+#. Receptor dataset (:py:obj:`~immuneML.data_model.dataset.ReceptorDataset.ReceptorDataset`) - each example is one paired-chain receptor consisting of two AIR sequences (e.g., TCR alpha-beta, or IGH heavy-light).
+#. Sequence dataset (:py:obj:`~immuneML.data_model.dataset.SequenceDataset.SequenceDataset`) - each example is one single AIR sequence chain.
+
+
+
+A single AIR rearrangement is represented by a :py:obj:`~immuneML.data_model.receptor.receptor_sequence.ReceptorSequence.ReceptorSequence` class.
+A Sequence dataset contains a set of such ReceptorSequence objects. A Receptor dataset contains a set of
+:py:obj:`~immuneML.data_model.receptor.receptor.Receptor.Receptor` objects, which contain two ReceptorSequences each.
+Relevant shared code for Sequence- and ReceptorDatasets can be found in the :py:obj:`~immuneML.data_model.dataset.ElementDataset.ElementDataset` class.
+A Repertoire dataset contains a set of :py:obj:`~immuneML.data_model.repertoire.receptor.Repertoire.Repertoire` objects, which
+each contain a set of ReceptorSequence objects.
+
+
+
+..
+  <note: figure not up to date, variable names have changed>
+
+  .. figure:: ../_static/images/dev_docs/data_model_architecture.png
+     :width: 70%
+
+     *UML diagram showing the immuneML data model, where white classes are abstract and define the interface only, while green are concrete and used throughout the codebase.*
+
+
+  The examples in an immuneML dataset can contain one or more labels, represented by the :py:obj:`~immuneML.environment.Label.Label` class.
+  The classes of such labels are what an ML method aims to learn to predict.
+
+
+
+..
+  A :py:obj:`~immuneML.data_model.dataset.SequenceDataset.SequenceDataset` contains a collection of labelled :py:obj:`~immuneML.data_model.receptor.receptor_sequence.ReceptorSequence.ReceptorSequence` objects.
+  A
+
+  A :py:obj:`~immuneML.data_model.receptor.receptor_sequence.ReceptorSequence.ReceptorSequence`
+
+      A Repertoire dataset consists of many Sequences. A receptor dataset consists of pairs of ReceptorSequences.
+
+
+
 
-Useful function in the dataset classes include getting the metadata information from the :py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset`,
-using :py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset.get_metadata` function, obtaining the number of examples in the
-dataset, checking possible labels or making subsets.
+..
+    The UML diagram showing these classes and the underlying dependencies is shown below.
+    The ReceptorSequence
 
-The UML diagram showing these classes and the underlying dependencies is shown below.
 
-.. figure:: ../_static/images/dev_docs/data_model_architecture.png
-  :width: 70%
+    A Repertoire dataset consists of many Sequences. A receptor dataset consists of pairs of ReceptorSequences.
 
-  UML diagram showing the immuneML data model, where white classes are abstract and define the interface only, while green are concrete and used throughout the codebase.
 
-Implementation details for :code:`ReceptorDataset` and :code:`SequenceDataset` are available in :py:obj:`~immuneML.data_model.dataset.ElementDataset.ElementDataset`.
+    Implementation details for :code:`ReceptorDataset` and :code:`SequenceDataset` are available in .
diff --git a/docs/source/developer_docs/dev_docs_util.rst b/docs/source/developer_docs/dev_docs_util.rst
diff --git a/docs/source/developer_docs/encoded_data_object.rst b/docs/source/developer_docs/encoded_data_object.rst
@@ -0,0 +1,9 @@
+  **EncodedData:**
+
+  - :code:`examples`: a design matrix where the rows represent Repertoires, Receptors or Sequences ('examples'), and the columns the encoding-specific features. This is typically a numpy matrix, but may also be another matrix type (e.g., scipy sparse matrix, pytorch tensor, pandas dataframe).
+  - :code:`encoding`: a string denoting the encoder base class that was used.
+  - :code:`labels`: a dictionary of labels, where each label is a key, and the values are the label values across the examples (for example: {disease1: [positive, positive, negative]} if there are 3 repertoires). This parameter should be set only if :code:`EncoderParams.encode_labels` is True, otherwise it should be set to None. This can be created by calling utility function :code:`EncoderHelper.encode_dataset_labels()`.
+  - :code:`example_ids`: a list of identifiers for the examples (Repertoires, Receptors or Sequences). This can be retrieved using :code:`Dataset.get_example_ids()`.
+  - :code:`feature_names`: a list of feature names, i.e., the names given to the encoding-specific features. When included, list must be as long as the number of features.
+  - :code:`feature_annotations`: an optional pandas dataframe with additional information about the features. When included, number of rows in this dataframe must correspond to the number of features. This parameter is not typically used.
+  - :code:`info`: an optional dictionary that may be used to store any additional information that is relevant (for example paths to additional output files). This parameter is not typically used.
diff --git a/docs/source/developer_docs/example_code/RandomDataPlot.py b/docs/source/developer_docs/example_code/RandomDataPlot.py
@@ -0,0 +1,89 @@
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import plotly.express as px
+
+from immuneML.data_model.dataset.Dataset import Dataset
+from immuneML.reports.ReportOutput import ReportOutput
+from immuneML.reports.ReportResult import ReportResult
+from immuneML.reports.data_reports.DataReport import DataReport
+from immuneML.util.ParameterValidator import ParameterValidator
+from immuneML.util.PathBuilder import PathBuilder
+
+
+class RandomDataPlot(DataReport):
+    """
+    This RandomDataPlot is a placeholder for a real Report.
+    It plots some random numbers.
+
+    **Specification arguments:**
+
+    - n_points_to_plot (int): The number of random points to plot.
+
+
+    **YAML specification:**
+
+    .. indent with spaces
+    .. code-block:: yaml
+
+        definitions:
+            reports:
+                my_report:
+                    RandomDataPlot:
+                        n_points_to_plot: 10
+
+    """
+
+    @classmethod
+    def build_object(cls, **kwargs):
+        # Here you may check the values of given user parameters
+        # This will ensure immuneML will crash early (upon parsing the specification) if incorrect parameters are specified
+        ParameterValidator.assert_type_and_value(kwargs['n_points_to_plot'], int, RandomDataPlot.__name__, 'n_points_to_plot', min_inclusive=1)
+
+        return RandomDataPlot(**kwargs)
+
+    def __init__(self, dataset: Dataset = None, result_path: Path = None, number_of_processes: int = 1, name: str = None,
+                 n_points_to_plot: int = None):
+        super().__init__(dataset=dataset, result_path=result_path, number_of_processes=number_of_processes, name=name)
+        self.n_points_to_plot = n_points_to_plot
+
+    def check_prerequisites(self):
+        # Here you may check properties of the dataset (e.g. dataset type), or parameter-dataset compatibility
+        # and return False if the prerequisites are incorrect.
+        # This will generate a user-friendly error message and ensure immuneML does not crash, but instead skips the report.
+        # Note: parameters should be checked in 'build_object'
+        return True
+
+    def _generate(self) -> ReportResult:
+        PathBuilder.build(self.result_path)
+        df = self._get_random_data()
+
+        # utility function for writing a dataframe to a csv file
+        # and creating a ReportOutput object containing the reference
+        report_output_table = self._write_output_table(df, self.result_path / 'random_data.csv', name="Random data file")
+
+        # Calling _safe_plot will internally call _plot, but ensure immuneML does not crash if errors occur
+        report_output_fig = self._safe_plot(df=df)
+
+        # Ensure output is either None or a list with item (not an empty list or list containing None)
+        output_tables = None if report_output_table is None else [report_output_table]
+        output_figures = None if report_output_fig is None else [report_output_fig]
+
+        return ReportResult(name=self.name,
+                            info="Some random numbers",
+                            output_tables=output_tables,
+                            output_figures=output_figures)
+
+    def _get_random_data(self):
+        return pd.DataFrame({"random_data_dim1": np.random.rand(self.n_points_to_plot),
+                             "random_data_dim2": np.random.rand(self.n_points_to_plot)})
+
+    def _plot(self, df: pd.DataFrame) -> ReportOutput:
+        figure = px.scatter(df, x="random_data_dim1", y="random_data_dim2", template="plotly_white")
+        figure.update_layout(template="plotly_white")
+
+        file_path = self.result_path / "random_data.html"
+        figure.write_html(str(file_path))
+        return ReportOutput(path=file_path, name="Random data plot")
+