Skip to content

Conversation

@bodowd
Copy link
Owner

@bodowd bodowd commented Aug 17, 2025

This PR removes the count prefix at the front of the molecule representation used in duckdb_rdkit. The first 4 bytes of the prefix are inlined and currently used to hold counts of certain features of the molecule to make exact match faster. The remaining bytes of the prefix hold a fingerprint for substructure screens, called the dalke_fp in the code.

The counts prefix can be removed and instead the first 4 bytes of the dalke_fp inlined. This can make substructure searches faster. By doing so, the first 4 bytes of the dalke_fp can be compared without a pointer chase. Currently, every substructure search requires a pointer chase.

The exact match function can still use the first 4 bytes of the dalke_fp since if those bits don't match, the two molecules cannot be the same anyways. However, for the exact match query, it would be better to use the RDKit Registration hash. The Registration hash is more robust and powerful because it would allow stereo insensitive and tautomer insensitive searches.

Future directions could include investigating the RDKit Pattern fingerprint instead of the dalke_fp for substructure screens and also implementing the Registration Hash

bodowd added 3 commits August 10, 2025 14:50
The count prefix is removed and the prefix to the umbra mol now contains
only the dalke fp for substructure screens. Previously, the dalke fp was
after the count prefix and could only be accessed after a pointer chase.
Now the first 4 bytes of the dalke fp is inlined and can be screened
already without the pointer chase
The bits that appear later in the dalke fp vector seem to be more
divisive and could lead to more substructure filtering. Testing this out
…bra mol"

This reverts commit 5ac1d3a.

Did some quick simple tests, but so far doesn't seem to make things faster
@bodowd bodowd changed the title Remove count prefix Remove count prefix in the molecule representation (umbra_mol) Aug 17, 2025
@bodowd bodowd merged commit 953f427 into main Aug 17, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants