Add implementation for built-in jaccard similarity #70

jstammers · 2024-10-16T21:30:13Z

Adds the duckdb builtin jaccard similarity which currently does not always result in the same value as mismo.sets.jaccard

NickCrews · 2024-10-17T06:16:39Z

I think this is worth considering, but watch out:

you thought the built-in duckdb function split on whitespace, and then did the set operations (ie each set element is a token). Equivalent in python to len(set(a.split(" ")) & set(b.split(" ")) / len(set(a.split(" ")) | set(b.split(" ")))
but it actually treats it where each element is a character. Equivalent in python to len(set(a) & set(b)) / len(set(a) | set(b))

I think the fact that you and the writers of duckdb had two different assumptions here shows that being implicit here can lead to mistakes. I could see users of mismo wanting either method in different situations. One option is to just not support it, and make users call sets.jaccard(<user splits strings as they desire>). But I think it's common enough that supporting the common modes seems like it might avoid user mistakes.

What about an API of something like

from typing import Literal

def jaccard(a: ir.StringValue, b: ir.StringValue, *, tokenize: Literal["by_character", "on_whitespace"]) -> ir.FloatingValue:
    ....

Then the user has to explicitly choose the method they want. Make it required, so no hidden default. kwarg so it is clear. I would love to workshop the names. Is there some other common method we're missing here? Of course, the user can just implement it themselves with StringValue.split() or StringValue.regex_split().

NickCrews · 2024-10-17T06:19:02Z

mismo/text/tests/test_similarity.py

+    [
+        ("foo", "foo", 1),
+        ("foo bar", "foo", 0.3333),  # this is currently failing
+        ("foo bar", "bar foo", 1),


Let's figure out the semantics first in the main comment thread, but eventually I will want to see

empty case

NULL case

case with repeated elements in one set, eg jaccard("foo foo bar", "foo baz") -> 1/3

jstammers · 2024-10-22T14:16:06Z

Thanks for spotting that @NickCrews. I hadn't considered the fact that duckdb considers each character as a separate element. From a quick search online, I've found multiple examples for both tokenizing by word and by character, so I think it makes sense to be explicit about which method to use when calculating the jaccard similarity of two strings of text.

jstammers · 2024-10-22T21:28:28Z

I've added some functionality to tokenize either by word or by character (perhaps something like similarity: Literal['word', 'character'] would be clearer, along with some further unit tests. From what I've found, it seems like the standard definition of the Jaccard similarity is based upon unique elements (i.e. a set as opposed to an array). Perhaps this should be made explicit in mismo.sets.jaccard

add builtin jaccard string similarity

f0fc4c2

jstammers mentioned this pull request Oct 16, 2024

Add splink-like string similarity comparisons #71

Open

NickCrews reviewed Oct 17, 2024

View reviewed changes

jstammers added 3 commits October 22, 2024 15:17

Merge branch 'main' into update/jaccard-string

f7b90e1

feat: add word/character jaccard similarity

189b51d

feat: add test cases

d6a1b39

jstammers requested a review from NickCrews October 22, 2024 21:23

jstammers added 2 commits November 20, 2024 12:27

Merge branch 'main' into update/jaccard-string

52bcbbc

fix: correct failing unit tests

33dc844

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add implementation for built-in jaccard similarity #70

Add implementation for built-in jaccard similarity #70

jstammers commented Oct 16, 2024

NickCrews commented Oct 17, 2024

NickCrews Oct 17, 2024

jstammers commented Oct 22, 2024

jstammers commented Oct 22, 2024

Add implementation for built-in jaccard similarity #70

Are you sure you want to change the base?

Add implementation for built-in jaccard similarity #70

Conversation

jstammers commented Oct 16, 2024

NickCrews commented Oct 17, 2024

NickCrews Oct 17, 2024

Choose a reason for hiding this comment

jstammers commented Oct 22, 2024

jstammers commented Oct 22, 2024