Memoize assembly states to avoid redundant recursive search by Garrett-Pz · Pull Request #95 · DaymudeLab/assembly-theory

Garrett-Pz · 2025-07-21T18:48:02Z

Resolves #70.

Overview

Implements memoization of assembly states (i.e., sets of molecule fragments) for more efficient top-down recursive search. We explored several modes for memoization across two dimensions:
- How are assembly states keyed? In Frags* modes, states are keyed by a sorted list of their fragments; in Canon* modes, states are keyed by the sorted list of their fragments' canonical labelings, allowing isomorphic assembly states to map to the same memoized values.
- What assembly state values are stored? In *Index modes, a state's upper bound on the assembly index is stored (in the code, its state_index); ~~in *Savings modes, the best possible savings from a given state is stored.~~
Implements the memoization cache using a dashmap::Dashmap for parallel-awareness.
Recursive calls now carry the order in which matches are removed to avoid a parallel correctness issue (see below).
Reworks integration test coverage to build up from independent modes to ones that interact, including memoization.
Adds benchmarks for memoization.
Exposes memoization functionality to the Python package and adds relevant Python test coverage.

Design Decisions (Avoiding Bugs and Race Conditions)

*Savings modes were abandoned because they interact incorrectly with bounds. Suppose that the optimal assembly pathway requires visiting a state S with fragments F and index upper bound x. However, it is possible that earlier in the computation a state S' with fragments F is reached with index upper bound y > x. Suppose that the best possible savings obtainable starting from fragments F is s. Then it is possible that

y - s does not beat the best found assembly index
x - s does beat the best found assembly index.

Then when S' is reached, it is possible that a bound will return that the best index can not be beaten from this state, and computation of this state's children will not be evaluated. Thus, the savings stored in the cache will not be the true value s. Later, when state S is reached, the memoization cache returns the wrong result and may cause computation along this branch to be incorrectly halted.

Parallelism can cause incorrect assembly index calculation when considering only state' index upper bounds. In a serial execution, the top-down recursive search explores the possible permutations of match removal in a fixed, non-redundant order. This order can be violated by parallelism, which can interact incorrectly with memoization as follows.

Suppose that we have a molecular graph G with two subgraphs H1 and H2 such that:

H1 would be removed earlier in the serial match removal order than H2
G - H1 and G - H2 have the same cache key (again, w.r.t. the memoization mode in use)
The best assembly pathway removes both H1 and H2

Parallelism makes it possible that the search path that removes H2 first (and thus will never remove H1 by the removal ordering) is visited and memoized before the best assembly pathway removing H1 and later H2. Because G - H1 and G - H2 have the same cache key, the former's (non-optimal) value will be returned from the cache in place of computing the latter's optimal value, and the best assembly index will never be found.

We avoid this by tracking match removal_orders across recursive calls and adding these to the cache alongside the index upper bounds, ensuring that if assembly states with "earlier" match removal orders are allowed to continue regardless of what is cached.

jdaymude · 2025-07-30T06:34:21Z

The results of cargo bench --bench benchmark -- bench_memoize, which benchmarks only the search phase using different memoization modes (including different canonization methods, in the case of MemoizeMode::CanonIndex) in combination with ParallelMode::DepthOne and bounds [Bound::Int, Bound::VecSimple, Bound::VecSmallFrags]:

bench_memoize/gdb13_1201/no-memoize                                                                           
                        time:   [21.862 ms 22.308 ms 22.929 ms]
bench_memoize/gdb13_1201/frags-index                                                                           
                        time:   [25.275 ms 25.486 ms 25.800 ms]
bench_memoize/gdb13_1201/nauty-index                                                                          
                        time:   [47.466 ms 47.854 ms 48.266 ms]
bench_memoize/gdb13_1201/tree-nauty-index                                                                           
                        time:   [44.675 ms 44.982 ms 45.314 ms]
bench_memoize/gdb17_200/no-memoize                                                                          
                        time:   [78.444 ms 79.589 ms 80.713 ms]
bench_memoize/gdb17_200/frags-index                                                                          
                        time:   [79.491 ms 81.278 ms 83.075 ms]
bench_memoize/gdb17_200/nauty-index                                                                          
                        time:   [114.87 ms 116.06 ms 117.28 ms]
bench_memoize/gdb17_200/tree-nauty-index                                                                          
                        time:   [105.92 ms 106.92 ms 107.94 ms]
bench_memoize/checks/no-memoize                                                                           
                        time:   [11.542 ms 11.712 ms 11.874 ms]
bench_memoize/checks/frags-index                                                                           
                        time:   [11.024 ms 11.234 ms 11.435 ms]
bench_memoize/checks/nauty-index                                                                          
                        time:   [21.752 ms 22.202 ms 22.674 ms]
bench_memoize/checks/tree-nauty-index                                                                          
                        time:   [15.739 ms 16.028 ms 16.348 ms]
bench_memoize/coconut_55/no-memoize                                                                          
                        time:   [306.34 ms 315.74 ms 323.90 ms]
bench_memoize/coconut_55/frags-index                                                                          
                        time:   [261.05 ms 268.63 ms 275.56 ms]
bench_memoize/coconut_55/nauty-index                                                                          
                        time:   [278.89 ms 282.33 ms 285.54 ms]
bench_memoize/coconut_55/tree-nauty-index                                                                          
                        time:   [238.27 ms 242.49 ms 246.43 ms]

Key takeaways:

For smaller molecules (gdb13_1201, gdb17_200, and checks), memoization of any kind makes things slower, though MemoizeMode::FragsIndex adds very little overhead.
For larger molecules (coconut_55), memoization of any kind speeds things up quite a bit, with a combination of MemoizeMode::CanonIndex and CanonizeMode::TreeNauty improving the most.
Faster canonization means faster MemoizeMode::CanonIndex; from the perspective of memoization, there is never a reason to use CanonizeMode::Nauty over CanonizeMode::TreeNauty.

Garrett-Pz requested a review from jdaymude July 21, 2025 18:48

jdaymude force-pushed the modular-dp branch from d9a4464 to a2a09cd Compare July 21, 2025 19:51

Garrett-Pz added 11 commits July 21, 2025 16:11

Add Seet method of dp to serial search

6e848b2

Wrap cache in an Option

81af6fe

Add dp using dashmap to depth one parallelizatoin

75f221f

Add ability to turn dp on and off in depth one search

43087f1

Add dp to always parallel

6bab9db

Add support for multiple memoize methods

2f67d0e

Add dp with canonization

bd29d8c

Add dp with depth one parallelization

a690394

Add dp to always parallel. change default dp

4b3b851

Reorder bounds and dp

029d55e

Move where cache is created

2cb4273

jdaymude force-pushed the modular-dp branch from a2a09cd to 2cb4273 Compare July 21, 2025 23:11

Garrett-Pz added 6 commits July 22, 2025 14:19

Remove mol from Cache struct

4d2a4ba

Sort canonized fragments before caching

41fa655

Cache previously computed canon labels

9bb0ba3

Remove borken cache modes and add IndexCanon mode

7ae6c39

Fix errors and warnings

60bb7e0

Fix parallelization bug

b329959

Garrett-Pz marked this pull request as ready for review July 25, 2025 00:37

jdaymude changed the title ~~Add Dynamic Programming~~ Memoize assembly states to avoid redundant recursive search Jul 25, 2025

jdaymude added 2 commits July 25, 2025 15:06

Merge branch 'main' into modular-dp

c7ae062

fix: Compute all memoizing canonizations on the fly and fix merge errors

7ec4c67

DaymudeLab deleted a comment from Garrett-Pz Jul 29, 2025

feat: Make memoization default, expose to Python, fix race conditions

a4cd610

jdaymude force-pushed the modular-dp branch from 28b3802 to a4cd610 Compare July 29, 2025 21:49

jdaymude added 2 commits July 29, 2025 16:25

feat: Add memoization benchmarks

26d2c9a

feat: Overhaul integration test coverage for all implemented modes

d020b30

jdaymude approved these changes Jul 30, 2025

View reviewed changes

jdaymude merged commit 7539cef into main Jul 30, 2025
11 checks passed

jdaymude deleted the modular-dp branch July 30, 2025 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memoize assembly states to avoid redundant recursive search#95

Memoize assembly states to avoid redundant recursive search#95
jdaymude merged 22 commits intomainfrom
modular-dp

Garrett-Pz commented Jul 21, 2025 •

edited by jdaymude

Loading

Uh oh!

jdaymude commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Garrett-Pz commented Jul 21, 2025 • edited by jdaymude Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Design Decisions (Avoiding Bugs and Race Conditions)

Uh oh!

jdaymude commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Garrett-Pz commented Jul 21, 2025 •

edited by jdaymude

Loading