Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An alternative to #6: Gabe Becker's proposed 2-repo solution #10

Closed
wlandau opened this issue Mar 1, 2024 · 16 comments
Closed

An alternative to #6: Gabe Becker's proposed 2-repo solution #10

wlandau opened this issue Mar 1, 2024 · 16 comments

Comments

@wlandau
Copy link
Member

wlandau commented Mar 1, 2024

Suppose r-releases.r-universe.dev is a repo with all the releases, and there is downstream universe with just the ones that pass R CMD check and revdep checks, just as @gmbecker originally proposed in r-universe-org/help#363. It should be simple to scrape the check results from https://github.com/r-universe/r-releases/actions, select a subset of https://github.com/r-releases/r-releases.r-universe.dev/blob/main/packages.json with non-broken packages, and then create a different universe downstream.

As part of that selection process, maybe we could impose version number etiquette too. Suppose we get the version numbers and their commit hashes when we scrape https://github.com/r-universe/r-releases/actions. (@jeroen, this may rely on the nice titles you give the jobs, such as r.releases.utils 0.0.5 and sys 3.3.) If we detect that the commit hashes are different but the latest version is not strictly greater than the previous one, then we can omit the package from the production repo.

The advantages over #6 are:

  1. The ability to use install.pacakges() normally.
  2. Faster installation and no risk of hitting rate limits because there would be no GitHub API calls at installation time.

To me (2) is more important than (1).

The challenges relative to #6 are:

  1. Learning how to scrape https://github.com/r-universe/r-releases/actions.
  2. Figuring out where to put that downstream universe.

I was hoping to have all repos part of https://github.com/r-releases, but I think the creation of a new universe would mean the creation of a new special repo, e.g. https://github.com/r-prd/r-prd.r-universe.dev. I would be open to a better name than this.

@wlandau wlandau changed the title An alternative approach: Gabe Becker's proposed 2-repo solution An alternative to #6: Gabe Becker's proposed 2-repo solution Mar 1, 2024
@wlandau
Copy link
Member Author

wlandau commented Mar 1, 2024

What would be a good GitHub owner name for this new downstream production-level R universe? r-prd? r-releases-prod? r-valid?

@shikokuchuo
Copy link
Member

shikokuchuo commented Mar 1, 2024

I might be missing something, but whether a package is 'broken' or not depends on the cohort of packages the user actually has installed, doesn't it? If only 2 repos, then a package can only be 'broken' or not. There may be many valid dependency chains, with only one broken.

A -> B -> C
....... B -> D

Where A is upstream. An update to A causes B's tests to fail. It is put in the broken repo along with C and D.

However, in actuality C's tests all pass, only D's fail. This is as C and D use different subsets of functions from B. That means that A -> B -> C is a valid dependency chain that would be broken by this 2 repo arrangement.

Then just using a 'normal' install.packages() won't find any of B, C or D any more.

@wlandau
Copy link
Member Author

wlandau commented Mar 1, 2024

Yeah, the whole revdep chain would need to go down too. It’s a little extra work up front, but then we could skip scraping those revdeps altogether. Not impossible for this way of doing things.

@wlandau
Copy link
Member Author

wlandau commented Mar 1, 2024

As you say, maybe that’s heavy-handed. However, I don’t see a generic way to find out which subset of a package is failing, just from information in logs.

@shikokuchuo
Copy link
Member

If the test suite is adequate, then a package only needs to pass its own tests right. It doesn't need to know if an upstream dependency passes all of its tests, or even further removed whether that package's 100 revdeps pass theirs.

So I'm quite in favour of the checks dashboard type thing, or a function that returns this. You only need to know for the package you are installing. Then on an ongoing basis, the checker function can come in handy.

It's the power of decentralisation. Let each individual community decide what it wants to use.

@shikokuchuo
Copy link
Member

shikokuchuo commented Mar 1, 2024

sorry maybe this belongs in #6. Discussion continued at #6 (comment)

@wlandau
Copy link
Member Author

wlandau commented Mar 2, 2024

From #6 (comment)

  1. After Efficiently get the check results of a small list of packages r-universe-org/help#370, implementation can begin.
  2. After Rerun a package's checks whenever a strong dependency updates r-universe-org/help#369, user-side package correctness/compatibility guarantees will exceed those of CRAN.

These points also support #10. With r-universe-org/help#369, it will only be necessary to scrape the existing check results (no need for revdep checks).

@wlandau
Copy link
Member Author

wlandau commented Mar 8, 2024

For a downstream production-level repo, it would be ideal to leverage R-universe as much as possible. My only concern is that we may get a duplicated (and possibly conflicting) set of health checks.

@wlandau
Copy link
Member Author

wlandau commented Mar 8, 2024

Actually, it could be important to pass health checks in both production and QA. So we would want to pull from both https://r-releases.r-universe.dev and "https://r-production.r-universe.dev" to decide whether to keep a package on "https://r-production.r-universe.dev".

@wlandau
Copy link
Member Author

wlandau commented Mar 8, 2024

On second thought: to have the right user-side guarantees, I think we would need to remove reverse dependencies from "https://r-production.r-universe.dev" if something goes wrong with a package. If that is the case, then https://r-releases.r-universe.dev/ and "https://r-production.r-universe.dev" will have the exact same dependency graphs for every hosted package. Which means that any test failure in "https://r-production.r-universe.dev" is random and probably a false positive.

So my current preference is to:

  1. If a package checks fail in https://r-releases.r-universe.dev, remove both the package and all its strong reverse dependencies from "https://r-production.r-universe.dev".
  2. Ignore checks from "https://r-production.r-universe.dev" when deciding (1).
  3. In fact, consider suppressing R CMD check in "https://r-production.r-universe.dev" to avoid confusion and duplication.

@shikokuchuo
Copy link
Member

Yes, I think 3 is the logical conclusion, you'd be able to rely on the checks from R-releases.

@wlandau
Copy link
Member Author

wlandau commented Mar 11, 2024

To recap recent discussions: we decided to put #6 on hold as we pursue #10. If the dual-repo option works well, then we will close #6 as "not planned".

@gmbecker
Copy link

I might be missing something, but whether a package is 'broken' or not depends on the cohort of packages the user actually has installed, doesn't it? If only 2 repos, then a package can only be 'broken' or not. There may be many valid dependency chains, with only one broken.

A -> B -> C ....... B -> D

Where A is upstream. An update to A causes B's tests to fail. It is put in the broken repo along with C and D.

However, in actuality C's tests all pass, only D's fail. This is as C and D use different subsets of functions from B. That means that A -> B -> C is a valid dependency chain that would be broken by this 2 repo arrangement.

Then just using a 'normal' install.packages() won't find any of B, C or D any more.

If B isnt passing its own tests, then B is broken, meaning it should only be offered in a "use at your own risk" capacity. That risk may sometimes be quite small, e.g., the notorious 1 test breaks on M1 macs case, but without an evolution of how tests are treated in R packages, similar to what @HenrikBengtsson brought up in the latest working group call, install.packages doesn't have the ability to differentiate quantify risk.

Given then that there is some risk, my argument is that that risk should be opt-in rather than opt-out. Users can opt into that risk by adding the unsafe repo (or whatever we end up calling it if that is too pejorative) to their repos, either via option or via the argument to install.packages. If they did that, they would be able to get all of {A, B, C, D}).

I think making risk like this opt-out would be detrimental to end users, particularly novice ones, since the tooling is insufficient to even tell them that the risk exists, much less to help them assess it. Furthermore it would be antithethical to the concept of production, as while you might need to do this but it would need to be a manual intervention by the admin in my experience, and may (reasonably) not be allowed at all in a validated context, regardless of how unbroken we might expect C's functionality to be.

The other thing to keep in mind is that just because someone does isntall.packages("C"), does not mean that they won't also sometimes directly use functionality from B in their scripts, including parts of B that aren't the bits that C use. B could still be broken for some of their intended purposes, even if C itself "works fine", which would mean that the repo is still serving a package broken to its intended purpose to the user.

@shikokuchuo
Copy link
Member

Thank you @gmbecker, we are taking all of these considerations into account. For these and other reasons, i.e. prior expectations for novice users using install.packages(), we are actually looking at your 2-repo proposal as a priority. The 'production' repo could then be the default as you describe above, with the choice of opting out to the wider 'community' or 'QA' repo or whatever you want to call it.

@wlandau
Copy link
Member Author

wlandau commented May 21, 2024

We now have space to host the two repos:

repo QA production
install.packages(repos = "...") https://multiverse.r-multiverse.org https://production.-multiverse.org
packages.json https://github.com/r-multiverse/multiverse https://github.com/r-multiverse/production
R-universe https://github.com/r-universe/r-multiverse https://github.com/r-universe/r-production

I am about to start working on:

  1. Migrating existing infrastructure to the new location for the QA universe.
  2. Building the production packages.json based on the results of automated checks.

@wlandau
Copy link
Member Author

wlandau commented Jun 21, 2024

The two-repo strategy is well underway, and given #57, I think we can close the thread above.

@wlandau wlandau closed this as completed Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants