Memoize assembly states to avoid redundant recursive search#95
Merged
Memoize assembly states to avoid redundant recursive search#95
Conversation
Contributor
|
The results of Key takeaways:
|
jdaymude
approved these changes
Jul 30, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #70.
Overview
Frags*modes, states are keyed by a sorted list of their fragments; inCanon*modes, states are keyed by the sorted list of their fragments' canonical labelings, allowing isomorphic assembly states to map to the same memoized values.*Indexmodes, a state's upper bound on the assembly index is stored (in the code, itsstate_index);in*Savingsmodes, the best possible savings from a given state is stored.dashmap::Dashmapfor parallel-awareness.Design Decisions (Avoiding Bugs and Race Conditions)
*Savingsmodes were abandoned because they interact incorrectly with bounds. Suppose that the optimal assembly pathway requires visiting a stateSwith fragmentsFand index upper boundx. However, it is possible that earlier in the computation a stateS'with fragmentsFis reached with index upper boundy > x. Suppose that the best possible savings obtainable starting from fragmentsFiss. Then it is possible thaty - sdoes not beat the best found assembly indexx - sdoes beat the best found assembly index.Then when
S'is reached, it is possible that a bound will return that the best index can not be beaten from this state, and computation of this state's children will not be evaluated. Thus, the savings stored in the cache will not be the true values. Later, when stateSis reached, the memoization cache returns the wrong result and may cause computation along this branch to be incorrectly halted.Parallelism can cause incorrect assembly index calculation when considering only state' index upper bounds. In a serial execution, the top-down recursive search explores the possible permutations of match removal in a fixed, non-redundant order. This order can be violated by parallelism, which can interact incorrectly with memoization as follows.
Suppose that we have a molecular graph
Gwith two subgraphsH1andH2such that:H1would be removed earlier in the serial match removal order thanH2G - H1andG - H2have the same cache key (again, w.r.t. the memoization mode in use)H1andH2Parallelism makes it possible that the search path that removes
H2first (and thus will never removeH1by the removal ordering) is visited and memoized before the best assembly pathway removingH1and laterH2. BecauseG - H1andG - H2have the same cache key, the former's (non-optimal) value will be returned from the cache in place of computing the latter's optimal value, and the best assembly index will never be found.We avoid this by tracking match
removal_orders across recursive calls and adding these to the cache alongside the index upper bounds, ensuring that if assembly states with "earlier" match removal orders are allowed to continue regardless of what is cached.