Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[GSoC] Parallelisation of AnalysisBase with multiprocessing and dask (#…
…4162) * add parallelisation to AnalysisBase * fixes #4158 DETAILED COMMENTS FROM COMMITS * Remove _scheduler attribute and make dask-based tests run properly * Refactor scheduler usage * Add multiple workers in dask for testing * Refactor _setup_bslices and add processes to dask scheduler kwargs * Create frame_indices and trajectory for each bslice during _setup_bslices * Use explicit initialisation of timeseries wiith zeros * Add non-trivial _parallel_conclude function * Fix tests for new dask fixture * Add type-matching _parallel_conclude * Add fixtures to test combinations of dask and multiprocessing * dask and multiprocessing works in test_atomicdistances.py * Fix bug in results is np.ndarray codepath * Add _setup_scheduler raising NotImplemented error in align.py::AverageStructure * dask and multiprocessing schedulers to test_align.py * dask scheduler for test_contacts.py and test for incompatibility with multiprocessing * dask and multiprocessing scheduler for test_density.py * Add _parallel_conclude implementation for dielectric * dask and multiprocessing schedulers for test_dielectric.py * dask and multiprocessing schedulers for test_diffusionmap.py * Add NotImplementedError for parallel schedulers in dihedrals.py * only current scheduler for test_dihedrals.py * dask and multiprocessing tests for test_encore.py -- but some fail because of RMSF module * Add NotImplementedError for _setup_scheduler in gnm.py * Add NotImplementedError for _setup_scheduler in helix_analysis.py * current process scheduler for test_helix_analysis.py * dask and multiprocessing schedulers for test_hole2.py * Add NotImplementedError in for not-None schedulers * current process scheduler and test for failing non-current ones in test_hydrogendbonds_analysis.py * current process only scheduler and failing test for others in test_lineardensity.py * Add NotImplementedError for non-current process schedulers * current process scheduler only and failing tests for non-current ones in test_msd.py * Add NotImplementedError for non-current process schedulers * Fix scope of fixtures * Add NotImplemented error for all non-current process schedulers * only current process scheduler and failing tests for test_nucleicacids.py * dask and multiprocessing schedulers for test_persistentlength.py * Add _parallel_conclude implementation * dask and multiprocessing schedulers for test_psa.py * Add _parallel_conclude implementation for RDF and RDF_S * dask and multiprocessing schedulers for test_rdf_s.py * dask and multiprocessing schedulers for test_rdf.py * Add NotImplementedError for RMSD and RMSF classes * only local process scheduler and failing tests for others for test_rms.py * current process scheduler only and failing test for others for test_wbridge.py * Add NotImplementedError in _setup_scheduler * Add more clear message during exception * Add timeseries aggregation function * dask and multiprocessing scheduler for most of the test_base.py testcases * dask and multiprocessing schedulers for test_rms.py::TestRMSD * Add NotImplementedError for pca and rms * dask and multiprocessing schedulers for test_bat * dcurrent process scheduler for test_gnm.py * dcurrent process scheduler for test_pca.py * Fix rmsf-related scheduler usage to only current process scheduler * remove fixme marks * Switch to enumerate in _compute main loop and fix code review comments * Add dask to CI setup actions * Remove local scheduler for progressbar test * Add installation with dask as asetup option * fix hole2 tests for -- implement only current scheduler and add failing test * fix progressbar test by changing order of ProgressBar and enumerate * use only frame indices and frames in _setup_bslices after writing a blogpost * Refactor _setup_bslices: move enumerate to numpy and fuse logic in defining type of input * Add documentation to AnalysisBase._parallel_conclude() * add functional-like interface draft * Implement proper Client class, separating computations from AnalysisBase * FINALLY implement working one-time dask cluster setup in kwargs of a client * Correct tests accordingly * Separately process case of only one remote worker * Add available_schedulers to AverageStructure * Use automatic fixture for AverageStructure * Add fixture for AverageStructure * Add fixture for AtomicDistances * Change default available_backends to all implemented in Client * Limit available backends for AverageStructure * Add fixture for BAT * Add fixture tests to Contacts * Fix n_workers check and boolean frames handling * Fix performance of backend="dask" * Add available_backends for Contacts * Remove _setup_scheduler * Use client fixture for Contacts * Use client fixture for RMSD/RMSF * Revert files to their state in develop * Delete files_for_undoing_changes.txt * Delete conftest.py * Delete parallel_analysis_demo.ipynb * Clean up notebook * Limit available schedulers in RMSF * Split test in two due to failing with "expectation" parametrization * Add fixture generator and fixtures for test_base and test_rms * Add dask to pyproject.toml * Return computation groups explicitly * Fix dask position in setup-deps/action.yaml * Add dask[distributed] to mdanalysis[parallel] installation * Undo autoformatter * Manually define available_backends for RMSD class * Create separate "parallel" entry * Add is_installed function to utils * Add dict-based validatdion and computation logic for ParallelExecutor * Add tests for ParallelExecutor * Add documentation for "apply" method of ParallelExecutor * Correct dask.distributed name * Use chunksize=1 instead of explicit Pool in _compute_with_dask * Remove unnecessary function in conftest * Fix function to retrieve dask client if dask is not installed * Fix base tests when dask is not installed * Use new LocalCluster every time * Fix client/backend logic * Add documentation to a silly square function * Switch to package-wise autouse fixture for dask.distributed.Client * Add explicit result() when computing with cluster * Fix codereview * Replace list with tuple in available_backends for RMSD * Remove unnecessary get_running_dask_client * Implement fixture injection for subclasses testing * Add warnings filters * Fix backend check when client is present * Return get_runnning_dask_client function * Change dask fixture scope * Close LocalCluster to avoid trillions of logs * Implement ResultsGroup based aggregation instead of type matching * Add non-default _get_aggregator() to RMS and Base classes * Mark test_multiprocessing.py::test_creating_multiple_universe_without_offset as skipped * Restore failing test * Make aggregation functions static methods of ResultsGroup * Remove test skip * Move parallel part into a separate file * Fix imports * Proof of concept for duck-typed backends * Remove unused code * Replace ParallelExecutor with multiple backend classes and add duck-typing backend in AnalysisBase.run() * Add all tests for analysis/parallel.py and fix bug in ResultsGroup.ndarray_mean * Change typing to py3.9 compatible syntax * Add _is_parallelizable to AnalysisFromFunction * Remove dask[distributed] even as an optional dependency * Update documentation * Remove function to get running dask client * Remove unused code from analysis/conftest.py * Fix documentation and minor issues from codereview * Update package/MDAnalysis/analysis/rms.py Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> * Add more backend validation tests and fix autoformatter issues * Start implementing correct result sizes in separate computation groups * Continue working: diffusionmap and PCA tests fail * Fix bug in PCA trajectory iteration -- avoid explicit usage of self.start * update changelog and tests for PCA fix * Fix diffusionmap and pca * Make sure not to reset self.{start,stop,step} during self._compute * Change iteration pattern to sliced trajectory * Change iteration pattern to sliced trajectory * Update package/MDAnalysis/analysis/parallel.py Co-authored-by: Yuxuan Zhuang <yuzhuang@stanford.edu> * Apply suggestions from code review Co-authored-by: Rocco Meli <r.meli@bluemail.ch> Co-authored-by: Yuxuan Zhuang <yuzhuang@stanford.edu> * Split _setup_frames into two separate functions * Add docstrings for _prepare_sliced_trajectory and _define_run_frames * Remove dask-distributed from dependencies * Test only 2 processors with parallelizable backends * Rename available_backends and safe * Apply codereview changes * Make tests for AnalysisBase subclasses explicit * Exclude "multiprocessing" from analysis_class function available backends * Split parallel.py into results.py and parallel.py * Finalize separation of results and backends * Rename parallel.py to backends.py * Add results and backends to analysis/__init__.py * Fix pep8 errors in docstrings and code * Add versionadded to documentation * Update sphinx documentation with backends and results * Add parallelization reference to base.py * Switch to relative imports * Update documentation, adding introduced changes * Update documentation adding parallelization support for rms * Add module documentation to results and backends * Fix BackendSerial validation and add its tests * Fix calling of self._is_paralellizable() * Add tests on is_parallelizable and get_supported_backends * Fix bug with default progressbar_kwargs being dict * Apply suggestions from code review Co-authored-by: Rocco Meli <r.meli@bluemail.ch> Co-authored-by: Yuxuan Zhuang <yuzhuang@stanford.edu> * Add docstrings to apply() in backends * Add double n_worker check * Apply suggestions from code review Co-authored-by: Paul Smith <paul.j.smith@ucl.ac.uk> * Fix hasattr in double n_worker check * Revert test `with expectation` in test_align * Update testsuite/MDAnalysisTests/analysis/test_pca.py Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> * Update package/MDAnalysis/lib/util.py Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> * Update changelog * Apply suggestions from code review * Add parallelization section to the documentation * Fix versionadded in new classes * Finish parallelization section for documentation * Fix typos * Apply suggestions from code review Co-authored-by: Rocco Meli <r.meli@bluemail.ch> * Apply suggestions from code review Co-authored-by: Rocco Meli <r.meli@bluemail.ch> * Refactor TreadsBackend example and add a warning * Add n_workers instantiation from backend argument * Update package/MDAnalysis/analysis/backends.py Co-authored-by: Yuxuan Zhuang <yuzhuang@stanford.edu> * Update package/doc/sphinx/source/documentation_pages/analysis/parallelization.rst Co-authored-by: Yuxuan Zhuang <yuzhuang@stanford.edu> * Add remark about RMSF parallelization * Apply suggestions from codereview * Apply suggestions from code review * Fix documentation typo * Update dask installation test after exception text changed * edited documentation for parallelization - add reST/sphinx markup for methods and classes and ensure that (most of them) resolve; add intersphinx mapping to dask docs - added cross referencing between parallelization and backends docs - restructured analysis landing page with additional numbered headings for general use and parallelization - add citation for PMDA - fixed links - edited text for flow and readability - added SeeAlsos (eg for User Guide) - added notes/warnings * analysis top level docs fixes - mark analysis docs as documenting MDAnalysis.analysis so that references resolve properly - link fixes * Added comments regarding `_is_parallelizable` (and fixed documentation), fixed tests for `is_installed` * Rename AnalysisBase.parallelizable and fix parallelizable transformations * Remove explicit parallelizable=True in NoJump test call * Apply suggestions from code review * add explicit comment to AnalysisBase._analysis_algorithm_is_parallelizable * Add client_RMSD explanation * versioninformation markup fix in base.py * Apply suggestions from code review Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Rocco Meli <r.meli@bluemail.ch> * Apply suggestions from code review Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Rocco Meli <r.meli@bluemail.ch> * Add comments explaining client_... fixtures * Move class properties to the top of the class * Undo accidental versionadded change * Remove duplicating versionadded * Add versionadded for backend * Add link to github profile * Update package/doc/sphinx/source/documentation_pages/analysis/parallelization.rst Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> * Update testsuite/MDAnalysisTests/analysis/test_backends.py Co-authored-by: Rocco Meli <r.meli@bluemail.ch> * minor text fixes * Update package/MDAnalysis/analysis/base.py Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> * Update package/MDAnalysis/analysis/base.py Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> * Remove issubclass check --------- Co-authored-by: Egor Marin <marin@phystech.edu> Co-authored-by: Egor Marin <marin.phystech@gmail.com> Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Yuxuan Zhuang <yuzhuang@stanford.edu> Co-authored-by: Rocco Meli <r.meli@bluemail.ch> Co-authored-by: Paul Smith <paul.j.smith@ucl.ac.uk> Co-authored-by: Yuxuan Zhuang <yuxuan.zhuang@dbb.su.se> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
- Loading branch information