Conversation
dimas-b
left a comment
There was a problem hiding this comment.
Nice new tools, @MonkeyCanCode 👍
Just one comment about matching logic 😅
| # Subsequence match: enabled for length > 2 | ||
| if query_len > 2: | ||
| iterator = iter(t) | ||
| if all(char in iterator for char in q): |
There was a problem hiding this comment.
This will match q: max, t: mixed bag of exceptions, right? Is that intended?
There was a problem hiding this comment.
Yes. Similar to fuzzy search, we don't know the total length. So users can reduce the search result by providing more characters. I put 3 characters minimal before fuzzy search to avoid user typed 'a' then it returns everything contains letter 'a'.
There was a problem hiding this comment.
I'm fine with this logic if it works for you :) just wanted to make sure the behaviour was intentional :)
There was a problem hiding this comment.
I am more than happy to take feedback on how to better handle this and if min of 4 characters is too verbose to trigger a fuzzy search. This requirement is from @flyrain , any thoughts on this route?
There was a problem hiding this comment.
TBH, I'm not sure if there are any (realistic) cases that will get a match by this rule, but not get a match by the SequenceMatcher (below) 🤔 Do you have any examples like that?
There was a problem hiding this comment.
This was added to avoid FP noise. For example, if we allow SequenceMatcher on any character lengths, a single letter a will match anything contains letter a.
Thus, what I thought was following:
- len 1: only exact or prefix match
- len 2: add substring match (q in t)
- len 3: add subsequence match
- len 4+: similarity ratio check via SequenceMatcher
When I was testing this earlier with setup setup, allow similarity search on len 3 is too noise. Thus, I added subsequence match here instead. But it is not really necessary if a bit noise output is acceptable.
There was a problem hiding this comment.
In my personal opinion, matching max to mixed bag of exceptions (the subsequence rule) is noise too 😅 TBH, I do not see "logic" behind this rule 😅
I'd use SequenceMatcher immediately if exact substring matches do not yield True, but use different thresholds depending on the query string size to reduce noise.
However, like I said, I do not feel strongly about this.
This is phase two of CLI: Add summarize subcommand, with great feedback from @flyrain and community from ML, this PR added the following support:
findcommand to locate identifier via fuzzy searchtablescommand to handle some basic Iceberg table operation (get/list/summarize/non-purge delete)Also, a newline is added per section for
summarizesub-commands introduced from phase one for easier readability.While working on this, I noticed our test suits for CLI is a bit messy. I will create a follow up PR to clean up those and add missing one (currently the missing test cases are been tracked via #4017)
Here are couple sample output:
Find command
Tables command
Setup instructions used for above
Checklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)