Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hamming dist join #110

Merged
merged 19 commits into from
Feb 14, 2024
Merged

Hamming dist join #110

merged 19 commits into from
Feb 14, 2024

Conversation

beniaminogreen
Copy link
Owner

This PR adds joining method for the hamming distance. Specifically, it adds the following functions:

  • hamming_inner_join
  • hamming_full_join
  • hamming_left_join
  • hamming_right_join
  • hamming_anti_join

It also adds the hamming_probability and hamming_distance functions to help users set the right parameters when tuning the LSH joining methods. I have tried my best to integrate the new changes (using collapse::%!iin% and early returns in the hamming_join_core function).

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cool! I just added a few comments, mostly about typos or organization of docs

Comment on lines +177 to +179
#' @param n_bands The number of LSH bands used in hashing.
#'
#' @param band_width The number of hashes in each band.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those have the same definition as in other functions you can use @inheritParams jaccard_left_join for example

clean=clean)
}

#' Fuzzy anti-join using minihashing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the same @rdname tag as in the other join functions to simplify the docs and gather those functions them in a single page

Comment on lines +27 to +28
a_col <- gsub("[[:punct:] ]", "", dplyr::pull(a, by_a))
b_col <- gsub("[[:punct:] ]", "", dplyr::pull(b, by_b))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a_col <- gsub("[[:punct:] ]", "", dplyr::pull(a, by_a))
b_col <- gsub("[[:punct:] ]", "", dplyr::pull(b, by_b))
a_col <- tolower(gsub("[[:punct:] ]", "", dplyr::pull(a, by_a)))
b_col <- tolower(gsub("[[:punct:] ]", "", dplyr::pull(b, by_b)))

beniaminogreen and others added 8 commits February 14, 2024 09:18
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
@etiennebacher
Copy link
Collaborator

Note that you will need to run devtools::document() locally to update all Rd files with those documentation changes

@beniaminogreen beniaminogreen merged commit aeb3eac into main Feb 14, 2024
7 checks passed
@etiennebacher etiennebacher deleted the hamming_dist_join branch February 14, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants