Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: accept provenance data in artifact pipeline check #872

Open
wants to merge 8 commits into
base: staging
Choose a base branch
from

Conversation

behnazh-w
Copy link
Member

@behnazh-w behnazh-w commented Sep 27, 2024

Refactoring the artifact pipeline detection check

  • Renames mcn_infer_artifact_pipeline_1 to mcn_find_artifact_pipeline_1.
  • This check can support all the package registries now.
  • Modifies the check fact table schema by adding new columns and allowing some existing columns to be nullable. This change enables us to store the reasons for check failures, such as when a GitHub workflow run is deleted, which may result in some previous columns lacking values.
  • Improve the heusristics, e.g., if an artifact is published before the corresponding code is committed, there cannot be a CI pipeline that triggered the publishing.
  • This check depends on the deploy command identified by the mcn_build_as_code_1 check. If a deploy command is detected, this check will attempt to locate a successful CI pipeline that triggered the step containing the deploy command.
  • When a verifiable provenance is found for an artifact, we use it to obtain the pipeline trigger. Otherwise, we use heuristics to find the triggering pipeline.

Improvements to mcn_build_as_code_1

  • If a provenance is found, we obtain the workflow that has triggered the artifact release.
  • Add support for Reusable GitHub Actions that perform automatic deployment. Since we do not analyze the external Reusable GitHub Actions, we use an allow list of approved Actions.
  • A new function, infer_confidence_deploy_workflow is added to BaseBuildTool to infer the confidence for such Reusable workflows.

The store_inferred_build_info_results function

  • Renamed store_inferred_provenance to store_inferred_build_info_results.
  • To avoid confusion, we avoid using the term inferred provenance here and instead simplify store build related information in the context object provided to checks.
  • CIInfo["provenances"] is also renamed to CIInfo["build_info_results"].

Provenance Extractor

  • New abstractions added to the provenance extractor to reuse the logic for extracting information such as ProvenanceBuildDefinition and ProvenancePredicate. With these new abstractions, we don't need to hardcode the expected buildType value while processing a provenance.

find_publish_timestamp

  • Added an API that can obtain the artifact timestamp for all the supported package registries.
  • By default we use deps.dev to obtain the timestamp except for Maven artifacts because we have observed that Maven Central has more accurate results.
  • Decoupled the Maven Central search API from the repository, making the hostname fully configurable to enable offline testing with a localhost server.

Tutorial and integration tests

  • Changed the Detecting a malicious Java dependency uploaded manually to Maven Central tutorial to Detecting Java dependencies manually uploaded to Maven Central
  • Used log4j-core artifact instead of guava, which has an automated deployment workflow.
  • Fixed the integration tests and added a new one for log4j-core.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 27, 2024
@behnazh-w behnazh-w force-pushed the behnazh/refactor-infer-publish-check branch 5 times, most recently from ac6cbcd to 7eac146 Compare October 2, 2024 09:47
@behnazh-w behnazh-w force-pushed the behnazh/refactor-infer-publish-check branch from 1586789 to e592c5d Compare October 29, 2024 05:29
@behnazh-w behnazh-w marked this pull request as ready for review October 29, 2024 05:29
@behnazh-w behnazh-w requested a review from benmss October 29, 2024 05:30
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
@behnazh-w behnazh-w force-pushed the behnazh/refactor-infer-publish-check branch from e592c5d to c9761dc Compare November 6, 2024 06:03
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
actions/setup-java
# Parent project used in Maven-based projects of the Apache Logging Services.
apache/logging-parent/.github/workflows/build-reusable.yaml
# This action can be used to deploy artifacts to a JFrog artifactory server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this entry belong to builder.maven.ci.deploy instead?

@@ -494,7 +503,7 @@ artifact_extensions =
# Package registries.
[package_registry]
# The allowed time range (in seconds) from a deploy workflow run start time to publish time.
publish_time_range = 3600
publish_time_range = 7200
Copy link
Member

@tromai tromai Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why did we decide to increase the publish time range ? (e.g in what case the time range of 3600 is not enough)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this file name to reflect the new name of the check ?

request to fetch metadata about the package, and extracts the publication timestamp
from the response.

Note: The method expects the response to include a ``version`` field with a ``publishedAt``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does deps.dev have any reference about the format of this version field?. It would be good to include that here if that page is available.

# implemented at the beginning of the analyze command to ensure that the data
# is available for subsequent processing.

base_url_parsed = urllib.parse.urlparse(registry_url or "https://api.deps.dev")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default value of registry_url is https://api.devs.dev, could we make the type of this parameters registry_url: str = "https://api.devs.dev"?

# is available for subsequent processing.

base_url_parsed = urllib.parse.urlparse(registry_url or "https://api.deps.dev")
path_params = "/".join(["v3alpha", "purl", encode(purl).replace("/", "%2F")])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encode function here allows you to specify the set of safe characters that will not be encoded (see here). If we replace all / anyway, should we leverage this safe parameter?

logger.debug("Found timestamp: %s.", timestamp)

try:
return datetime.fromisoformat(timestamp.replace("Z", "+00:00"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timestamp value is already in ISO 8601 format, why do we need to perform extra modification on it before providing it to datetime.fromisoformat.


try:
return datetime.fromisoformat(timestamp.replace("Z", "+00:00"))
except (OverflowError, OSError) as error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder in what scenario OverflowError and OSError happens.
As far as I know, datetime.fromisoformat raises:

  • TypeError if the input is not a string.
  • ValueError if the date is in correct.

InvalidHTTPResponseError
If the URL construction fails, the HTTP response is invalid, or if the response
cannot be parsed correctly, or if the expected timestamp is missing or invalid.
NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we document this exception if it's not raise in this implementation ? 🤔

Copy link
Member

@tromai tromai Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments on this file that we don't need to fix it in this PR:

purl_object = PackageURL.from_string(purl)
except ValueError as error:
logger.debug("Could not parse PURL: %s", error)
query_params = [f"q=g:{purl_object.namespace}", f"a:{purl_object.name}", f"v:{purl_object.version}"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purl_object.version has type str | None. If it's none the last query param would be 'v:None'. I don't think it has the same behavior as the old implementation. In the old implementation, if version is None, no extra query param is added. I wonder if this new behavior is intended?

Comment on lines 38 to +42
provenances: Sequence[DownloadedProvenanceData]
"""The provenances data."""

build_info_results: InTotoV01Payload
"""The build information results computed for a build step. We use the in-toto 0.1 as the spec."""
Copy link
Member

@tromai tromai Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mentioned in the PR description that CIInfo["provenances"] is also renamed to CIInfo["build_info_results"]. Are we planning to remove this provenances attribute here too?

Comment on lines +172 to +175
# or Reusable GitHub Action to be be a GitHubJobNode.
if not isinstance(job, GitHubJobNode):
continue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# or Reusable GitHub Action to be be a GitHubJobNode.
if not isinstance(job, GitHubJobNode):
continue
# or Reusable GitHub Action to be a GitHubJobNode.
if not isinstance(job, GitHubJobNode):
continue

Comment on lines +669 to +672
if build_type := json_extract(statement["predicate"], ["buildType"], str):
return build_type

return json_extract(statement["predicate"], ["buildDefinition", "buildType"], str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to document why we have two different way to extract the build type

  • json_extract(statement["predicate"], ["buildType"], str)
  • json_extract(statement["predicate"], ["buildDefinition", "buildType"], str)

Comment on lines +697 to +711
build_type = ProvenancePredicate.get_build_type(statement)
build_defs: list[ProvenanceBuildDefinition] = [
SLSAGithubGenericBuildDefinitionV01(),
SLSAGithubActionsBuildDefinitionV1(),
SLSANPMCLIBuildDefinitionV2(),
SLSAGCBBuildDefinitionV1(),
SLSAOCIBuildDefinitionV1(),
WitnessGitLabBuildDefinitionV01(),
]

for build_def in build_defs:
if build_def.expected_build_type == build_type:
return build_def

raise ProvenanceError("Unable to find build definition in the provenance statement.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor improvement could be that we differentiate between 2 cases that we raise an exception:

  • There is no build definition, happens when ProvenancePredicate.get_build_type(statement) returns None.
  • There is a build definition, but we don't have support for its value (the condition build_def.expected_build_type == build_type in the for loop never meet).

Right now we only raise one kind of exception message for both cases.

@@ -355,3 +358,354 @@ def check_if_repository_purl_and_url_match(url: str, repo_purl: PackageURL) -> b
purl_path = f"{repo_purl.namespace}/{purl_path}"
# Note that the urllib method includes the "/" before path while the PURL method does not.
return f"{parsed_url.hostname}{parsed_url.path}".lower() == f"{expanded_purl_type or repo_purl.type}/{purl_path}"


class ProvenanceBuildDefinition(ABC):
Copy link
Member

@tromai tromai Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these or some of these abstractions should be put within the package src/macaron/slsa_analyzer/provenance as they are closely related to the provenance format 🤔 ?
I think only find_build_def and get_build_type should remain here as a static function.

Comment on lines +132 to +133
except NotImplementedError:
continue
Copy link
Member

@tromai tromai Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think catching NotImplementedError here has a strong indication that the interface of find_publish_timestamp from subtype of PackageRegistry will raise NotImplementedError. I believe this is not the case because only Jfrog Maven registry will raise NotImplementedError. A better way to communicate this would be to explicitly skip running find_publish_timestamp if registry_info is of type JFrogMavenRegistry.

        for registry_info in ctx.dynamic_data["package_registries"]:
            if isinstance(registry_info.package_registry, JfrogMavenRegistry):
                # We currently don't support this.
                continue 
            if registry_info.build_tool.purl_type == ctx.component.type:
                try:
                    artifact_published_date = registry_info.package_registry.find_publish_timestamp(ctx.component.purl)
                    break
                except InvalidHTTPResponseError as error:
                    logger.debug(error)

What do you think?

This is because I think NotImplementedError should only be raised when we mistakenly call a method that is not implemented (similar to an unexpected critical error).

Comment on lines +415 to +416
This method is intended to be implemented by subclasses to extract
specific invocation details from a provenance statement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove these 2 lines as this is a concrete method 🤔 ?
The same comment applies for SLSAGithubActionsBuildDefinitionV1.get_build_invocation

tuple[str | None, str | None]
A tuple containing two elements:
- The first element is the build invocation entry point (e.g., workflow name), or None if not found.
- The second element is the invocation URL or identifier (e.g., job URL), or None if not found.
Copy link
Member

@tromai tromai Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear about the difference between invocation URL and identifier. Does it mean that if the second element of the returned tuple is not None, it could be an URL or an arbitrary string as the "identifier" of the workflow run?

# TODO: change this check if this issue is resolved:
# https://github.com/orgs/community/discussions/138249
if datetime.now(timezone.utc) - timedelta(days=400) > timestamp:
logger.debug("Artifact published at %s is older than 410 days.", timestamp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug("Artifact published at %s is older than 410 days.", timestamp)
logger.debug("Artifact published at %s is older than 400 days.", timestamp)

Copy link
Member

@tromai tromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finished my review. Thanks.
Overall, there isn't any major changes needs. Most of my comments are for minor improvements/nit picking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants