Skip to content

Incorporating reproducibility metadata (package & dependency checksums) #34

@dgkf

Description

@dgkf

Motivation

Given a repository of quality metrics and a package repository, how do we guarantee that the metrics would be reproducible given packages from that repository?

User Experience

From an end-user perspective (administrator, analyst, etc), a user may want to include assertions about checksum consistency as part of their filter. Assuming CRAN is a moving target, this may mean that packages that were permitted a week ago may be filtered out today. To ensure that metrics continue to reflect a repository, a snapshot of the repository would be needed.

A user may provide a filter such as:

options(available_package_filters = risk_filter(...))

Where assertions about checksum matches (either for the package source code, hard dependencies or soft dependencies) can be enforced and used as part of the filtering criteria.

Repository structure

This will mean updating the PACKAGES format to include this metadata

Package: A3
Version: 1.0.0
Depends: R (>= 2.15.0), xtable, pbapply
Suggests: randomForest, e1071
# ... additional stats ...
MD5sum: 027ebdd8affce8f0effaecfcd5f5ade2
MD5sumReqDeps: 0a1b2c3d4f5g6h
MD5sumAllDeps: 0a1b2c3d4f5g6h
# ... metrics ...

Or alternatively, we can store hashes for the dependencies used during evaluation

Package: A3
Version: 1.0.0
Depends: R (>= 2.15.0), xtable, pbapply
Suggests: randomForest, e1071
MD5sum: 027ebdd8affce8f0effaecfcd5f5ade2
MD5sum/xtable: 0a1b2c3d4f5g6h
MD5sum/pbapply: 0a1b2c3d4f5g6h
MD5sum/randomForest: 0a1b2c3d4f5g6h
MD5sum/e1071: 0a1b2c3d4f5g6h

This would have the benefit of allowing us to ignore situations where Suggests dependencies are not available to an end user or were not available during evaluation and is a bit more interpretable at the expense of file size.

Implementation

I think the tools for deriving these checksums should live with the filtering tools because it will need to be re-derived for the package repository to apply a filter. But that function should be used in these pipelines to derive the same checksums during metric derivation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Planning

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions