-
Notifications
You must be signed in to change notification settings - Fork 1
Add 'recognized_source_url' data and 'has_recognized_source' metric #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #30 +/- ##
==========================================
+ Coverage 18.54% 20.58% +2.04%
==========================================
Files 31 33 +2
Lines 1332 1404 +72
==========================================
+ Hits 247 289 +42
- Misses 1085 1115 +30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dgkf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great. I left a lot a few suggestions, but don't take that as a sign that this isn't phenomenal work.
The most significant suggested change is regarding the decision to return NA when no url is found. Instead, I think it would be more appropriate to return a 0-length vector. This will just help to avoid the need for any special NA-handling for anyone that wants to operate on this data.
Really appreciate the thorough descriptors, test cases and clean code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to R/data_recognized_source.R to match the new name?
R/data_source_control.R
Outdated
| "The URL of the package's source control repository on a recognized", | ||
| "hosting platform, determined by matching against an allow-list of", | ||
| "known domains. Extracted from the URL and BugReports fields in the", | ||
| "DESCRIPTION file. The allow-list can be customized; see ?options for", | ||
| "details." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description field allows for Rd-style infotex syntax so that we can get rich formatting rendered to html/text. Most importantly, we can wrap things in \\code{...} when applicable:
| "The URL of the package's source control repository on a recognized", | |
| "hosting platform, determined by matching against an allow-list of", | |
| "known domains. Extracted from the URL and BugReports fields in the", | |
| "DESCRIPTION file. The allow-list can be customized; see ?options for", | |
| "details." | |
| "The \\acronym{URL} of the package's source control repository on a", | |
| "recognized hosting platform, determined by matching against an", | |
| "allow-list of known domains. Extracted from the \\acronym{URL} and ", | |
| "\\code{BugReports} fields in the \\code{DESCRIPTION} file. The ", | |
| "allow-list can be customized; see \\code{?options} for details." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unaware of this feature, it's great that it exists! I updated the description in recent commits.
R/data_source_control.R
Outdated
| # Try to find a URL matching known source control domains | ||
| # We use case-insensitive matching for domains | ||
| for (url in all_urls) { | ||
| for (domain in source_control_domains) { | ||
| if (grepl(domain, url, ignore.case = TRUE)) { | ||
| return(url) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| # No known source control URL found | ||
| NA_character_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a unique case, but it could be possible to have multiple source code urls (for example, if a package migrated from GitHub to codeberg, and is mirroring changes across both like I do for the options package).
In those situations, I think we might want to return all recognized domains.
Also, since our domains contain regex special characters (most importantly, the .), we need to either escape these characters or match on the fixed string (treating it as a raw string instead of a regex pattern).
| # Try to find a URL matching known source control domains | |
| # We use case-insensitive matching for domains | |
| for (url in all_urls) { | |
| for (domain in source_control_domains) { | |
| if (grepl(domain, url, ignore.case = TRUE)) { | |
| return(url) | |
| } | |
| } | |
| } | |
| # No known source control URL found | |
| NA_character_ | |
| is_src_url <- vapply( | |
| tolower(source_control_domains), | |
| grepl, | |
| logical(length(all_urls)), | |
| x = tolower(all_urls), | |
| fixed = TRUE | |
| ) | |
| source_control_domains[colSums(is_src_url) > 0] |
This uses a vectorized approach, but it's really not performance-critical. There are a couple quirks in here.
- Generally we'd use
grepl(fixed = TRUE), but this is incompatible withignore.case. Instead, we just convert both strings to lowercase during comparison. - I also opted to return a vector of length 0 instead of
NA_character_when no recognized url is found. This helps keep the function more type-stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it but in my original code I just decided to go with what seemed easier to implement. I agree, returning all URLs is reasonable, I implemented your suggestion in recent commit.
And sorry for missing the regexp wildcard match -- my first draft was done with stringi package but having seen that base R functions are used for strings manipulation throughout the package, I updated the code to use grepl. Quite sloppy, apparently.
R/data_source_control.R
Outdated
| "Indicates whether the package has a source code repository on a", | ||
| "recognized hosting platform from an allow-list of known domains.", | ||
| "Inferred from the URL and BugReports fields in the DESCRIPTION file.", | ||
| "See ?options for customizing the allow-list." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, here we can use Rd syntax
| "Indicates whether the package has a source code repository on a", | |
| "recognized hosting platform from an allow-list of known domains.", | |
| "Inferred from the URL and BugReports fields in the DESCRIPTION file.", | |
| "See ?options for customizing the allow-list." | |
| "Indicates whether the package has a source code repository on a", | |
| "recognized hosting platform from an allow-list of known domains.", | |
| "Inferred from the \\acronym{URL} and \\code{BugReports} fields in the", | |
| "\\code{DESCRIPTION} file. See \\code{?options} for customizing the", | |
| "allow-list." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
R/data_source_control.R
Outdated
| "See ?options for customizing the allow-list." | ||
| ), | ||
| function(pkg, resource, field, ...) { | ||
| !is.na(pkg$recognized_source_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you agree with the above changes, instead returning a 0-length vector when no source url is discovered, then this would change to
| !is.na(pkg$recognized_source_url) | |
| length(pkg$recognized_source_url) > 0L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to reflect the other changes.
The data documentation uses Rd syntax now
Instead of returning a length 1 char vector which possibly can be NA, we are returning a character vector that can be of any length which are not NA. If the lenght of vector is 0, this indicates that no recognized source URL was found.
|
Thanks for your review! Your points are valid and I implemented changes you suggested. |
dgkf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making all the fixes. Looks great!
This implements metric described in #13 . As discussed, it uses allow-list approach. The list of repositories is, of course, up for the debate.
I decided to name it "has recognized source" to underline the fact that it is based on the known list, not universal.
I tried to follow other standards from the package, including using
{options}for getting and setting the allow list.