Skip to content

Commit

Permalink
Merge pull request #171 from uio-bmi/docs_updates
Browse files Browse the repository at this point in the history
Documentation updates and bugfixes
  • Loading branch information
LonnekeScheffer authored Apr 26, 2024
2 parents ae3be16 + ba3604e commit c492ef7
Show file tree
Hide file tree
Showing 258 changed files with 6,929 additions and 6,011 deletions.
4 changes: 4 additions & 0 deletions docs/source/_static/files/ml_methods_properties.csv
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
ML method,binary classification,multi-class classification,sequence dataset,receptor dataset,repertoire dataset,model selection CV
AtchleyKmerMILClassifier,✓,✗,✗,✗,✓,✗
BinaryFeatureClassifier,✓,✗,✓,✗,✗,✗
DeepRC,✓,✗,✗,✗,✓,✗
KNN,✓,✓,✓,✓,✓,✓
KerasSequenceCnn,✓,✗,✓,✗,✗,✗
LogisticRegression,✓,✓,✓,✓,✓,✓
PrecomputedKNN,✓,✓,✓,✓,✓,✓
ProbabalisticBinaryClassifier,✓,✗,✗,✗,✓,✗
RandomForestClassifier,✓,✓,✓,✓,✓,✓
ReceptorCNN,✓,✗,✗,✓,✗,✗
SVC,✓,✓,✓,✓,✓,✓
SVM,✓,✓,✓,✓,✓,✓
TCRdistClassifier,✓,✓,✓,✓,✓,✗
Binary file added docs/source/_static/images/ML_setting.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/source/_static/images/analysis_paths.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/_static/images/definitions_instructions_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added docs/source/_static/images/yaml_structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@
'sphinx.ext.doctest',
'sphinx.ext.mathjax',
'sphinx.ext.viewcode',
'sphinx_rtd_theme',
'sphinx_toolbox.collapse',
# 'sphinx_rtd_theme',
'sphinx.ext.napoleon',
'sphinx.ext.autosectionlabel',
'sphinx_sitemap'
Expand Down
8 changes: 5 additions & 3 deletions docs/source/developer_docs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,27 +13,29 @@ Developer documentation
To get started with adding features to immuneML, follow the steps described here. These steps assume you have some experience with object-oriented programming in Python
and some basic knowledge of how Git works.

#. Read the :ref:`Information for new developers` and other relevant tutorials as needed.
#. Follow the guide to :ref:`Set up immuneML for development`.
#. Create a new branch with a descriptive name from the master branch.
#. Implement your changes.
#. Implement your changes and add documentation.
#. Make a pull request to the master branch.

.. toctree::
:maxdepth: 1
:caption: Developer tutorials

developer_docs/info_new_developers.rst
developer_docs/install_for_development.rst
developer_docs/how_to_add_new_ML_method.rst
developer_docs/how_to_add_new_encoding.rst
developer_docs/how_to_add_new_ML_method.rst
developer_docs/how_to_add_new_report.rst
developer_docs/how_to_add_new_preprocessing.rst

.. toctree::
:maxdepth: 1
:caption: Platform overview

developer_docs/platform_overview.rst
developer_docs/data_model.rst
developer_docs/execution_flow.rst

.. toctree::
:maxdepth: 4
Expand Down
62 changes: 62 additions & 0 deletions docs/source/developer_docs/caching.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@


To prevent recomputing the same result a second time, immuneML uses caching.
Caching can be applied to methods which compute an (intermediate) result.
The result is stored to a file, and when the same method call is made, the previously
stored result is retrieved from the file and returned.

We recommend applying caching to methods which are computationally expensive and may be called
multiple times in the same way. For example, encoders are a good target for caching as they
may take long to compute and can be called multiple times on the same data when combined
with different ML methods. But ML methods typically do not require caching, as you would
want to apply ML methods with different parameters or to differently encoded data.


Any method call in immuneML can be cached as follows:

.. code:: python
result = CacheHandler.memo_by_params(params = cache_params, fn = lambda: my_method_for_caching(my_method_param1, my_method_param2, ...))
The :code:`CacheHandler.memo_by_params` method does the following:

- Using the caching parameters, a unique cache key (random string) is created.
- CacheHandler checks if there already exists a previously computed result that is associated with this key.
- If the result exists, the result is returned without (re)computing the method.
- If the result does not exist, the method is computed, its result is stored using the cache key, and the result is returned.


The :code:`lambda` function call simply calls the method to be cached, using any required parameters.
The :code:`cache_params` represent the unique, immutable parameters used to compute the cache key.
It should have the following properties:

- It must be a nested tuple containing *only* immutable items such as strings, booleans and integers.
It cannot contain mutable items like lists, dictionaries, sets and objects (they all need to be converted nested tuples of immutable items).
- It should include *every* factor that can contribute to a difference in the results of the computed method.
For example, when caching the encode_data step, the following should be included:

- dataset descriptors (dataset id, example ids, dataset type),
- encoding name,
- labels,
- :code:`EncoderParams.learn_model` if used,
- all relevant input parameters to the encoder. Preferentially retrieved automatically (such as by :code:`vars(self)`),
as this ensures that if new parameters are added to the encoder, they are always added to the caching params.

For example, :py:obj:`~immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder` computes its
caching parameters as follows:

.. code:: python
def _prepare_caching_params(self, dataset, params: EncoderParams):
return (("dataset_identifier", dataset.identifier),
("example_identifiers", tuple(dataset.get_example_ids())),
("dataset_type", dataset.__class__.__name__),
("encoding", OneHotEncoder.__name__),
("labels", tuple(params.label_config.get_labels_by_name())),
("encoding_params", tuple(vars(self).items())))
The construction of caching parameters must be done carefully, as caching bugs are extremely difficult
to discover. Rather add 'too much' information than too little.
A missing parameter will not lead to an error, but can result in silently copying over
results from previous method calls.
33 changes: 33 additions & 0 deletions docs/source/developer_docs/class_documentation_standards.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes.
The class docstrings are used to automatically generate the documentation web pages, using Sphinx `reStructuredText <https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html>`_, and should adhere to a standard format:


#. A short, general description of the functionality

#. Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.

#. A list of arguments, when applicable. This should follow the format below:

.. code::
**Specification arguments:**
- parameter_name (type): a short description
- other_paramer_name (type): a short description
#. A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:

.. code::
**YAML specification:**
.. indent with spaces
.. code-block:: yaml
definitions:
yaml_keyword: # could be encodings/ml_methods/reports/etc...
my_new_class:
MyNewClass:
parameter_name: 0
other_paramer_name: 1
10 changes: 10 additions & 0 deletions docs/source/developer_docs/coding_conventions_and_tips.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. note::
**Coding conventions and tips**

#. Class names are written in CamelCase
#. Class methods are writte in snake_case
#. Abstract base classes :code:`MLMethod`, :code:`DatasetEncoder`, and :code:`Report`, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.
#. Class methods starting with _underscore are generally considered "private" methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
#. When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
#. If your class should have any default parameters, they should be defined in a default parameters file under :code:`config/default_params/`.
#. Some utility classes are available in the :code:`util` package to provide useful functionalities. For example, :py:obj:`~immuneML.util.ParameterValidator.ParameterValidator` can be used to check user input and generate error messages, or :py:obj:`~immuneML.util.PathBuilder.PathBuilder` can be used to add and remove folders.
68 changes: 57 additions & 11 deletions docs/source/developer_docs/data_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,67 @@ immuneML data model
:twitter:title: immuneML dev docs: data model
:twitter:image: https://docs.immuneml.uio.no/_images/data_model_architecture.png


immuneML works with adaptive immune receptor sequencing data.
Internally, the classes and data structures used to represent this data adheres to the `AIRR Rearrangement Schema <https://docs.airr-community.org/en/stable/datarep/rearrangements.html>`_,
although it is possible to import data from a wider variety of common formats.

Most immuneML analyses are based on the amino acid CDR3 junction.
Some analyses also use the V and J gene name ('call') information.
While importing of full-length (V + CDR3 + J) sequences is supported,
there are no functionalities in immuneML designed for analysing sequences
at that level.

An immuneML dataset consists of a set of 'examples'. These examples are the

immuneML data model supports three types of datasets that can be used for analyses:

#. Repertoire dataset (:py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset`) - one example in the dataset is one repertoire typically coming from one subject
#. Receptor dataset (:py:obj:`~immuneML.data_model.dataset.ReceptorDataset.ReceptorDataset`) - one example is one receptor with both chains set
#. Sequence dataset (:py:obj:`~immuneML.data_model.dataset.SequenceDataset.SequenceDataset`) - one example is one receptor sequence with single chain information.
#. Repertoire dataset (:py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset`) - each example in the dataset is a large set of AIR sequences which are typically derived from one subject (individual).
#. Receptor dataset (:py:obj:`~immuneML.data_model.dataset.ReceptorDataset.ReceptorDataset`) - each example is one paired-chain receptor consisting of two AIR sequences (e.g., TCR alpha-beta, or IGH heavy-light).
#. Sequence dataset (:py:obj:`~immuneML.data_model.dataset.SequenceDataset.SequenceDataset`) - each example is one single AIR sequence chain.



A single AIR rearrangement is represented by a :py:obj:`~immuneML.data_model.receptor.receptor_sequence.ReceptorSequence.ReceptorSequence` class.
A Sequence dataset contains a set of such ReceptorSequence objects. A Receptor dataset contains a set of
:py:obj:`~immuneML.data_model.receptor.receptor.Receptor.Receptor` objects, which contain two ReceptorSequences each.
Relevant shared code for Sequence- and ReceptorDatasets can be found in the :py:obj:`~immuneML.data_model.dataset.ElementDataset.ElementDataset` class.
A Repertoire dataset contains a set of :py:obj:`~immuneML.data_model.repertoire.receptor.Repertoire.Repertoire` objects, which
each contain a set of ReceptorSequence objects.



..
<note: figure not up to date, variable names have changed>
.. figure:: ../_static/images/dev_docs/data_model_architecture.png
:width: 70%

*UML diagram showing the immuneML data model, where white classes are abstract and define the interface only, while green are concrete and used throughout the codebase.*


The examples in an immuneML dataset can contain one or more labels, represented by the :py:obj:`~immuneML.environment.Label.Label` class.
The classes of such labels are what an ML method aims to learn to predict.



..
A :py:obj:`~immuneML.data_model.dataset.SequenceDataset.SequenceDataset` contains a collection of labelled :py:obj:`~immuneML.data_model.receptor.receptor_sequence.ReceptorSequence.ReceptorSequence` objects.
A
A :py:obj:`~immuneML.data_model.receptor.receptor_sequence.ReceptorSequence.ReceptorSequence`

A Repertoire dataset consists of many Sequences. A receptor dataset consists of pairs of ReceptorSequences.




Useful function in the dataset classes include getting the metadata information from the :py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset`,
using :py:obj:`~immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset.get_metadata` function, obtaining the number of examples in the
dataset, checking possible labels or making subsets.
..
The UML diagram showing these classes and the underlying dependencies is shown below.
The ReceptorSequence
The UML diagram showing these classes and the underlying dependencies is shown below.

.. figure:: ../_static/images/dev_docs/data_model_architecture.png
:width: 70%
A Repertoire dataset consists of many Sequences. A receptor dataset consists of pairs of ReceptorSequences.

UML diagram showing the immuneML data model, where white classes are abstract and define the interface only, while green are concrete and used throughout the codebase.

Implementation details for :code:`ReceptorDataset` and :code:`SequenceDataset` are available in :py:obj:`~immuneML.data_model.dataset.ElementDataset.ElementDataset`.
Implementation details for :code:`ReceptorDataset` and :code:`SequenceDataset` are available in .
5 changes: 0 additions & 5 deletions docs/source/developer_docs/dev_docs_util.rst

This file was deleted.

9 changes: 9 additions & 0 deletions docs/source/developer_docs/encoded_data_object.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
**EncodedData:**

- :code:`examples`: a design matrix where the rows represent Repertoires, Receptors or Sequences ('examples'), and the columns the encoding-specific features. This is typically a numpy matrix, but may also be another matrix type (e.g., scipy sparse matrix, pytorch tensor, pandas dataframe).
- :code:`encoding`: a string denoting the encoder base class that was used.
- :code:`labels`: a dictionary of labels, where each label is a key, and the values are the label values across the examples (for example: {disease1: [positive, positive, negative]} if there are 3 repertoires). This parameter should be set only if :code:`EncoderParams.encode_labels` is True, otherwise it should be set to None. This can be created by calling utility function :code:`EncoderHelper.encode_dataset_labels()`.
- :code:`example_ids`: a list of identifiers for the examples (Repertoires, Receptors or Sequences). This can be retrieved using :code:`Dataset.get_example_ids()`.
- :code:`feature_names`: a list of feature names, i.e., the names given to the encoding-specific features. When included, list must be as long as the number of features.
- :code:`feature_annotations`: an optional pandas dataframe with additional information about the features. When included, number of rows in this dataframe must correspond to the number of features. This parameter is not typically used.
- :code:`info`: an optional dictionary that may be used to store any additional information that is relevant (for example paths to additional output files). This parameter is not typically used.
89 changes: 89 additions & 0 deletions docs/source/developer_docs/example_code/RandomDataPlot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
from pathlib import Path

import numpy as np
import pandas as pd
import plotly.express as px

from immuneML.data_model.dataset.Dataset import Dataset
from immuneML.reports.ReportOutput import ReportOutput
from immuneML.reports.ReportResult import ReportResult
from immuneML.reports.data_reports.DataReport import DataReport
from immuneML.util.ParameterValidator import ParameterValidator
from immuneML.util.PathBuilder import PathBuilder


class RandomDataPlot(DataReport):
"""
This RandomDataPlot is a placeholder for a real Report.
It plots some random numbers.
**Specification arguments:**
- n_points_to_plot (int): The number of random points to plot.
**YAML specification:**
.. indent with spaces
.. code-block:: yaml
definitions:
reports:
my_report:
RandomDataPlot:
n_points_to_plot: 10
"""

@classmethod
def build_object(cls, **kwargs):
# Here you may check the values of given user parameters
# This will ensure immuneML will crash early (upon parsing the specification) if incorrect parameters are specified
ParameterValidator.assert_type_and_value(kwargs['n_points_to_plot'], int, RandomDataPlot.__name__, 'n_points_to_plot', min_inclusive=1)

return RandomDataPlot(**kwargs)

def __init__(self, dataset: Dataset = None, result_path: Path = None, number_of_processes: int = 1, name: str = None,
n_points_to_plot: int = None):
super().__init__(dataset=dataset, result_path=result_path, number_of_processes=number_of_processes, name=name)
self.n_points_to_plot = n_points_to_plot

def check_prerequisites(self):
# Here you may check properties of the dataset (e.g. dataset type), or parameter-dataset compatibility
# and return False if the prerequisites are incorrect.
# This will generate a user-friendly error message and ensure immuneML does not crash, but instead skips the report.
# Note: parameters should be checked in 'build_object'
return True

def _generate(self) -> ReportResult:
PathBuilder.build(self.result_path)
df = self._get_random_data()

# utility function for writing a dataframe to a csv file
# and creating a ReportOutput object containing the reference
report_output_table = self._write_output_table(df, self.result_path / 'random_data.csv', name="Random data file")

# Calling _safe_plot will internally call _plot, but ensure immuneML does not crash if errors occur
report_output_fig = self._safe_plot(df=df)

# Ensure output is either None or a list with item (not an empty list or list containing None)
output_tables = None if report_output_table is None else [report_output_table]
output_figures = None if report_output_fig is None else [report_output_fig]

return ReportResult(name=self.name,
info="Some random numbers",
output_tables=output_tables,
output_figures=output_figures)

def _get_random_data(self):
return pd.DataFrame({"random_data_dim1": np.random.rand(self.n_points_to_plot),
"random_data_dim2": np.random.rand(self.n_points_to_plot)})

def _plot(self, df: pd.DataFrame) -> ReportOutput:
figure = px.scatter(df, x="random_data_dim1", y="random_data_dim2", template="plotly_white")
figure.update_layout(template="plotly_white")

file_path = self.result_path / "random_data.html"
figure.write_html(str(file_path))
return ReportOutput(path=file_path, name="Random data plot")

Loading

0 comments on commit c492ef7

Please sign in to comment.