Index and other optimisations #922

shawnlaffan · 2024-02-24T05:28:37Z

A set of optimisations to several index and related calculations.

Summary of main changes:

Hierarchical calculations are now supported. This allows cluster tree calculations to build on the child results rather than rebuilding everything for each node.
Several endemism calculations now re-use results where the central and whole variants will be the same.
Significance assessments are faster.

Take advantage of the label hash global precalc, and use hash aliases instead of refs.

Might as well avoid any recursion overheads.

And stop throwing errors when ref is undefined in get_basedata_ref.

This was we avoid cloning basedata refs, analysis args and the like.

No need to find the index names when they are in the base_list_ref already. Also use refaliasing to avoid some derefs and declutter loop variables.

Passing in the base list allows fewer grep comparisons. This makes a large difference when there are many lists with many keys.

This allows future optimisations when calculating indices for cluster trees.

This allows several indices to be optimised when calculated for cluster nodes, providing they are done starting from the tips. PE has been optimised in this commit.

Avoids a lot of hash creation and deletion with large datasets.

Speeds up PD calcs for cluster trees.

Use a treenode method that caches, rather than repeatedly calling methods to get the same answer.

It is cleaner to pack the node and child names in their own structure. That also enables later additions without adding yet more top level arguments.

Maybe.

Use direct assignment if starting with empty list.

Avoids a lot of copying.

This can be a _very_ hot loop so even small differences add up.

These are cleared as we go to avoid leakage.

If the second neighbour set is empty then the whole and central variants return the same results. So short circuit in these cases.

shawnlaffan added 24 commits February 23, 2024 11:28

Indices: optimise _calc_endemism_absolute

11a9bbe

Take advantage of the label hash global precalc, and use hash aliases instead of refs.

Minor optimisations in _calc_endemism_hier_part

501bd7b

TreeNode.pm: use a linear scan for get_hash_lists_below

f9bcf98

Might as well avoid any recursion overheads.

Add an array args version of set_basedata_ref

1e7deb8

And stop throwing errors when ref is undefined in get_basedata_ref.

Trees: clone_without_caches also clears parameters

4dec15f

This was we avoid cloning basedata refs, analysis args and the like.

Common::get_zscore_from_comp_results - avoid a lot of grepping

965c1ad

No need to find the index names when they are in the base_list_ref already. Also use refaliasing to avoid some derefs and declutter loop variables.

optimise Tree::convert_comparisons_to_significances

6ee1109

Passing in the base list allows fewer grep comparisons. This makes a large difference when there are many lists with many keys.

optimise Spatial::convert_comparisons_to_significances

80bc505

Passing in the base list allows fewer grep comparisons. This makes a large difference when there are many lists with many keys.

Indices: add a hierarchical mode flag

81f38b3

This allows future optimisations when calculating indices for cluster trees.

Indices: Support hierarchical calculations

1c6a509

This allows several indices to be optimised when calculated for cluster nodes, providing they are done starting from the tips. PE has been optimised in this commit.

Indices: calc_labels_not_on_tree: return early if nothing to work with

5d6a8b4

Avoids a lot of hash creation and deletion with large datasets.

Indices: add a hierarchical variant of get_path_lengths_to_root_node

67f655e

Speeds up PD calcs for cluster trees.

Indices: _calc_endemism_hier_part: avoid some method calls

6d9c553

Use a treenode method that caches, rather than repeatedly calling methods to get the same answer.

Indices: _calc_endemism_hier_part: refactor some variables

8a69ac2

delete commented code

a37dbb3

Indices: refactor hierarchical node details

9b1846d

It is cleaner to pack the node and child names in their own structure. That also enables later additions without adding yet more top level arguments.

Squeeze a little more performance out of compare_lists_by_item

744c430

Maybe.

TreeNode::add_to_lists: optimise

7586528

Use direct assignment if starting with empty list.

Cluster spatial calcs: add lists by ref

0c34d12

Avoids a lot of copying.

compare_lists_by_item: lift a var outside the loop

4168546

This can be a _very_ hot loop so even small differences add up.

Indices: cache the current results from each sub

b2ee3ed

These are cleared as we go to avoid leakage.

Indices: reuse whole and central endemism results when appropriate

e173a47

If the second neighbour set is empty then the whole and central variants return the same results. So short circuit in these cases.

formatting

16a4f86

simplify by using List::Util::any

62fd8a1

shawnlaffan merged commit 3c590de into master Feb 24, 2024
8 checks passed

shawnlaffan deleted the indices_2024 branch February 24, 2024 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index and other optimisations #922

Index and other optimisations #922

shawnlaffan commented Feb 24, 2024

Index and other optimisations #922

Index and other optimisations #922

Conversation

shawnlaffan commented Feb 24, 2024