Remove count prefix in the molecule representation (umbra_mol) #25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR removes the count prefix at the front of the molecule representation used in duckdb_rdkit. The first 4 bytes of the prefix are inlined and currently used to hold counts of certain features of the molecule to make exact match faster. The remaining bytes of the prefix hold a fingerprint for substructure screens, called the
dalke_fpin the code.The counts prefix can be removed and instead the first 4 bytes of the dalke_fp inlined. This can make substructure searches faster. By doing so, the first 4 bytes of the dalke_fp can be compared without a pointer chase. Currently, every substructure search requires a pointer chase.
The exact match function can still use the first 4 bytes of the dalke_fp since if those bits don't match, the two molecules cannot be the same anyways. However, for the exact match query, it would be better to use the RDKit Registration hash. The Registration hash is more robust and powerful because it would allow stereo insensitive and tautomer insensitive searches.
Future directions could include investigating the RDKit Pattern fingerprint instead of the dalke_fp for substructure screens and also implementing the Registration Hash