Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include RemoteSha in the PACKAGES field of each universe? #377

Closed
wlandau opened this issue Mar 5, 2024 · 11 comments
Closed

Include RemoteSha in the PACKAGES field of each universe? #377

wlandau opened this issue Mar 5, 2024 · 11 comments

Comments

@wlandau
Copy link

wlandau commented Mar 5, 2024

Motivation

@gmbecker mentioned how important it is for users to be able to trust the versions numbers of packages. For R-releases, we will not impose any pre-release gatekeeping, but @shikokuchuo and I are working on a service that checks all the versions and hashes and reports which packages are not in compliance. We are having trouble building this service given what we currently know about R-universe. C.f. r-multiverse/help#21.

Implementation in R-releases

In r-multiverse/multiverse.internals#9 and r-multiverse/community#6, I wrote a service that runs once a day and gets the version and hash of every package in the universe. Every time the service runs, it keeps track of the highest version number ever released, as well as the hash of that release. We want it to flag a package for non-compliance if:

  1. The current version number is less than the highest version ever released, or
  2. If the current and highest ever versions agree, but their hashes disagree. (I.e. if the latest release is highest, but it was deployed without changing the version number.)

These non-compliant packages are written to a small file version_issues.json, which either Gabe's "safe" repo or "install_safe()" could leverage for choosing which packages are safe to install.

Challenge

We are having trouble getting reliable hashes. utils::available.packages(repos = "https://r-releases.r-universe.dev", fields = "RemoteSha") is fast, but it returns NAs for RemoteSha. And as @shikokuchuo mentioned, MD5s are brittle because R-universe rebuilds the current version periodically with potentially different metadata.

The API for https://r-releases.r-universe.dev/api/packages/ returns information for multiple packages, but the payload is large, and not all packages may be returned. (https://cran.r-universe.dev/api/packages/ shows only a few hundred.) Hitting the API for each package individually is slow, and I am concerned it may overburden R-universe.

Proposal

Would it be possible to include the GitHub SHA in the RemoteSha field of the DESCRIPTION file for packages built on R-universe the PACKAGES file of each universe, such as https://r-releases.r-universe.dev/src/contrib/PACKAGES? That way, unless I am missing something, available.packages() should work with r-multiverse/help#21, and it may even make the end product of #149 more trustworthy.

(I'm not sure whether https://r-releases.r-universe.dev/src/contrib would have include that field too.)

@jeroen
Copy link
Member

jeroen commented Mar 5, 2024 via email

@wlandau wlandau changed the title Include RemoteSha in the DESCRIPTION field of built packages? Include RemoteSha in the PACKAGES field of each universe? Mar 5, 2024
@jeroen
Copy link
Member

jeroen commented Mar 5, 2024

So these are fields that are included in the individual DESCRIPTION: https://jeroen.r-universe.dev/jsonlite/DESCRIPTION

But only a few of them are included in the index (to save space): https://jeroen.r-universe.dev/src/contrib/PACKAGES

@wlandau
Copy link
Author

wlandau commented Mar 5, 2024

I see. Would it be feasible to add RemoteSha to PACKAGES to support r-multiverse/help#21 and #149, or is a light PACKAGES file more of a priority for R-universe? In the latter case, what would you recommend for r-multiverse/help#21?

@jeroen
Copy link
Member

jeroen commented Mar 5, 2024

Perhaps we can make it opt-in via a parameter. Does it really need to work with base R available.packages() or are you more flexible? We could also include it just for the JSONLD index only e.g.: https://jeroen.r-universe.dev/src/contrib/

So then instead of base available.packages() you would need to use e.g.

df <- jsonlite::stream_in(url("https://jeroen.r-universe.dev/src/contrib/"), verbose = F)

@wlandau
Copy link
Author

wlandau commented Mar 5, 2024

Does it really need to work with base R available.packages() or are you more flexible?

I am flexible. I am good with anything that pulls the package names, version numbers, and RemoteShas of all packages quickly.

We could also include it just for the JSONLD index only e.g.: https://jeroen.r-universe.dev/src/contrib/. So then instead of base available.packages() you would need to use e.g...

Perfect!

jeroen added a commit to r-universe-org/cranlike-server that referenced this issue Mar 5, 2024
@jeroen
Copy link
Member

jeroen commented Mar 5, 2024

Not sure if it's a good idea to deploy from my flight but here is something you can test now:

https://jeroen.r-universe.dev/src/contrib/PACKAGES?fields=RemoteSha,RemoteUrl

https://jeroen.r-universe.dev/src/contrib/PACKAGES.json?fields=RemoteSha,RemoteUrl

So using this fields parameter you can request any additional fields (comma separated and case sensitive) from the DESCRIPTION files in the PACKAGES index.

@wlandau
Copy link
Author

wlandau commented Mar 5, 2024

Cool! Your query parameter idea looks like an elegant way to handle this, and it works for me in both cases:

system.time(
  packages_file <- utils::available.packages(
    contriburl = paste0(
      contrib.url("https://jeroen.r-universe.dev", type = "source"),
      "/PACKAGES?fields=RemoteSha,RemoteUrl"
    ),
    fields = "RemoteSha"
  )
)
#>    user  system elapsed 
#>   0.033   0.010   2.086
head(packages_file[, "RemoteSha"])
#>                                  RAppArmor 
#> "f437c1a926e7f5c225003738bca46584ee1a1f51" 
#>                                         V8 
#> "8adfc4c5ffc1f2da45206a53927d14046dfaa141" 
#>                                     badgen 
#> "57af6a1eab06369730a9ca520375ed6b78a0e5d6" 
#>                                     base64 
#> "0b8294d5d2ea1f1d1d069ef5ff681d90bdbc38ab" 
#>                                     bcrypt 
#> "49eb9da001cc6d3f118521d6e5221fb8909cfa6e" 
#>                                     brotli 
#> "00a9aa6a84cfcf2da6184a32a0ce7a7f1b9a8211"

system.time(
  json <- jsonlite::stream_in(
    url("https://jeroen.r-universe.dev/src/contrib/PACKAGES.json?fields=RemoteSha,RemoteUrl"),
    verbose = FALSE
  )
)
#>    user  system elapsed 
#>   0.036   0.003   0.858
head(json$RemoteSha)
#> [1] "f437c1a926e7f5c225003738bca46584ee1a1f51"
#> [2] "8adfc4c5ffc1f2da45206a53927d14046dfaa141"
#> [3] "57af6a1eab06369730a9ca520375ed6b78a0e5d6"
#> [4] "0b8294d5d2ea1f1d1d069ef5ff681d90bdbc38ab"
#> [5] "49eb9da001cc6d3f118521d6e5221fb8909cfa6e"
#> [6] "00a9aa6a84cfcf2da6184a32a0ce7a7f1b9a8211"

Created on 2024-03-05 with reprex v2.1.0

@wlandau
Copy link
Author

wlandau commented Mar 5, 2024

I noticed the query also works in the R-releases universe too: https://r-releases.r-universe.dev/src/contrib/PACKAGES?fields=RemoteSha,RemoteUrl. Okay if I use it in R-releases? Would you still rather me use the JSON route, or is PACKAGES/available.packages() okay too?

@jeroen
Copy link
Member

jeroen commented Mar 5, 2024

Yes go for it, I was only mentioning mine as example. The API is the same for any universe of course.

@jeroen
Copy link
Member

jeroen commented Mar 6, 2024

Can I close this as solved?

@wlandau
Copy link
Author

wlandau commented Mar 6, 2024

Certainly! Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants