-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hamming dist join #110
Hamming dist join #110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks cool! I just added a few comments, mostly about typos or organization of docs
#' @param n_bands The number of LSH bands used in hashing. | ||
#' | ||
#' @param band_width The number of hashes in each band. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those have the same definition as in other functions you can use @inheritParams jaccard_left_join
for example
clean=clean) | ||
} | ||
|
||
#' Fuzzy anti-join using minihashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the same @rdname
tag as in the other join functions to simplify the docs and gather those functions them in a single page
a_col <- gsub("[[:punct:] ]", "", dplyr::pull(a, by_a)) | ||
b_col <- gsub("[[:punct:] ]", "", dplyr::pull(b, by_b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a_col <- gsub("[[:punct:] ]", "", dplyr::pull(a, by_a)) | |
b_col <- gsub("[[:punct:] ]", "", dplyr::pull(b, by_b)) | |
a_col <- tolower(gsub("[[:punct:] ]", "", dplyr::pull(a, by_a))) | |
b_col <- tolower(gsub("[[:punct:] ]", "", dplyr::pull(b, by_b))) |
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Note that you will need to run |
This PR adds joining method for the hamming distance. Specifically, it adds the following functions:
It also adds the
hamming_probability
andhamming_distance
functions to help users set the right parameters when tuning the LSH joining methods. I have tried my best to integrate the new changes (usingcollapse::%!iin%
and early returns in thehamming_join_core
function).