-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: accept provenance data in artifact pipeline check #872
base: staging
Are you sure you want to change the base?
Conversation
ac6cbcd
to
7eac146
Compare
src/macaron/slsa_analyzer/checks/infer_artifact_pipeline_check.py
Outdated
Show resolved
Hide resolved
1586789
to
e592c5d
Compare
src/macaron/slsa_analyzer/ci_service/github_actions/github_actions_ci.py
Outdated
Show resolved
Hide resolved
src/macaron/slsa_analyzer/ci_service/github_actions/github_actions_ci.py
Outdated
Show resolved
Hide resolved
tests/integration/cases/micronaut-projects_micronaut-core/config.ini
Outdated
Show resolved
Hide resolved
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
e592c5d
to
c9761dc
Compare
Signed-off-by: behnazh-w <behnaz.hassanshahi@oracle.com>
actions/setup-java | ||
# Parent project used in Maven-based projects of the Apache Logging Services. | ||
apache/logging-parent/.github/workflows/build-reusable.yaml | ||
# This action can be used to deploy artifacts to a JFrog artifactory server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this entry belong to builder.maven.ci.deploy
instead?
@@ -494,7 +503,7 @@ artifact_extensions = | |||
# Package registries. | |||
[package_registry] | |||
# The allowed time range (in seconds) from a deploy workflow run start time to publish time. | |||
publish_time_range = 3600 | |||
publish_time_range = 7200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why did we decide to increase the publish time range ? (e.g in what case the time range of 3600 is not enough)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename this file name to reflect the new name of the check ?
request to fetch metadata about the package, and extracts the publication timestamp | ||
from the response. | ||
|
||
Note: The method expects the response to include a ``version`` field with a ``publishedAt`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does deps.dev
have any reference about the format of this version
field?. It would be good to include that here if that page is available.
# implemented at the beginning of the analyze command to ensure that the data | ||
# is available for subsequent processing. | ||
|
||
base_url_parsed = urllib.parse.urlparse(registry_url or "https://api.deps.dev") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the default value of registry_url
is https://api.devs.dev
, could we make the type of this parameters registry_url: str = "https://api.devs.dev"
?
# is available for subsequent processing. | ||
|
||
base_url_parsed = urllib.parse.urlparse(registry_url or "https://api.deps.dev") | ||
path_params = "/".join(["v3alpha", "purl", encode(purl).replace("/", "%2F")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encode
function here allows you to specify the set of safe characters that will not be encoded (see here). If we replace all /
anyway, should we leverage this safe
parameter?
logger.debug("Found timestamp: %s.", timestamp) | ||
|
||
try: | ||
return datetime.fromisoformat(timestamp.replace("Z", "+00:00")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timestamp value is already in ISO 8601 format, why do we need to perform extra modification on it before providing it to datetime.fromisoformat
.
|
||
try: | ||
return datetime.fromisoformat(timestamp.replace("Z", "+00:00")) | ||
except (OverflowError, OSError) as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder in what scenario OverflowError
and OSError
happens.
As far as I know, datetime.fromisoformat
raises:
TypeError
if the input is not a string.ValueError
if the date is in correct.
InvalidHTTPResponseError | ||
If the URL construction fails, the HTTP response is invalid, or if the response | ||
cannot be parsed correctly, or if the expected timestamp is missing or invalid. | ||
NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we document this exception if it's not raise in this implementation ? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments on this file that we don't need to fix it in this PR:
- On line 249, we don't use the utility json extract method. We might need to consider replacing it.
- On line 233, we can consider updating the if conditions similar to this discussion - refactor: accept provenance data in artifact pipeline check #872 (comment)
purl_object = PackageURL.from_string(purl) | ||
except ValueError as error: | ||
logger.debug("Could not parse PURL: %s", error) | ||
query_params = [f"q=g:{purl_object.namespace}", f"a:{purl_object.name}", f"v:{purl_object.version}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
purl_object.version
has type str | None
. If it's none the last query param would be 'v:None'
. I don't think it has the same behavior as the old implementation. In the old implementation, if version is None, no extra query param is added. I wonder if this new behavior is intended?
provenances: Sequence[DownloadedProvenanceData] | ||
"""The provenances data.""" | ||
|
||
build_info_results: InTotoV01Payload | ||
"""The build information results computed for a build step. We use the in-toto 0.1 as the spec.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mentioned in the PR description that CIInfo["provenances"] is also renamed to CIInfo["build_info_results"]
. Are we planning to remove this provenances
attribute here too?
# or Reusable GitHub Action to be be a GitHubJobNode. | ||
if not isinstance(job, GitHubJobNode): | ||
continue | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# or Reusable GitHub Action to be be a GitHubJobNode. | |
if not isinstance(job, GitHubJobNode): | |
continue | |
# or Reusable GitHub Action to be a GitHubJobNode. | |
if not isinstance(job, GitHubJobNode): | |
continue | |
if build_type := json_extract(statement["predicate"], ["buildType"], str): | ||
return build_type | ||
|
||
return json_extract(statement["predicate"], ["buildDefinition", "buildType"], str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to document why we have two different way to extract the build type
json_extract(statement["predicate"], ["buildType"], str)
json_extract(statement["predicate"], ["buildDefinition", "buildType"], str)
build_type = ProvenancePredicate.get_build_type(statement) | ||
build_defs: list[ProvenanceBuildDefinition] = [ | ||
SLSAGithubGenericBuildDefinitionV01(), | ||
SLSAGithubActionsBuildDefinitionV1(), | ||
SLSANPMCLIBuildDefinitionV2(), | ||
SLSAGCBBuildDefinitionV1(), | ||
SLSAOCIBuildDefinitionV1(), | ||
WitnessGitLabBuildDefinitionV01(), | ||
] | ||
|
||
for build_def in build_defs: | ||
if build_def.expected_build_type == build_type: | ||
return build_def | ||
|
||
raise ProvenanceError("Unable to find build definition in the provenance statement.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor improvement could be that we differentiate between 2 cases that we raise an exception:
- There is no build definition, happens when
ProvenancePredicate.get_build_type(statement)
returns None. - There is a build definition, but we don't have support for its value (the condition
build_def.expected_build_type == build_type
in the for loop never meet).
Right now we only raise one kind of exception message for both cases.
@@ -355,3 +358,354 @@ def check_if_repository_purl_and_url_match(url: str, repo_purl: PackageURL) -> b | |||
purl_path = f"{repo_purl.namespace}/{purl_path}" | |||
# Note that the urllib method includes the "/" before path while the PURL method does not. | |||
return f"{parsed_url.hostname}{parsed_url.path}".lower() == f"{expanded_purl_type or repo_purl.type}/{purl_path}" | |||
|
|||
|
|||
class ProvenanceBuildDefinition(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if these or some of these abstractions should be put within the package src/macaron/slsa_analyzer/provenance
as they are closely related to the provenance format 🤔 ?
I think only find_build_def
and get_build_type
should remain here as a static function.
except NotImplementedError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think catching NotImplementedError
here has a strong indication that the interface of find_publish_timestamp
from subtype of PackageRegistry
will raise NotImplementedError
. I believe this is not the case because only Jfrog Maven registry will raise NotImplementedError
. A better way to communicate this would be to explicitly skip running find_publish_timestamp
if registry_info
is of type JFrogMavenRegistry
.
for registry_info in ctx.dynamic_data["package_registries"]:
if isinstance(registry_info.package_registry, JfrogMavenRegistry):
# We currently don't support this.
continue
if registry_info.build_tool.purl_type == ctx.component.type:
try:
artifact_published_date = registry_info.package_registry.find_publish_timestamp(ctx.component.purl)
break
except InvalidHTTPResponseError as error:
logger.debug(error)
What do you think?
This is because I think NotImplementedError
should only be raised when we mistakenly call a method that is not implemented (similar to an unexpected critical error).
This method is intended to be implemented by subclasses to extract | ||
specific invocation details from a provenance statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove these 2 lines as this is a concrete method 🤔 ?
The same comment applies for SLSAGithubActionsBuildDefinitionV1.get_build_invocation
tuple[str | None, str | None] | ||
A tuple containing two elements: | ||
- The first element is the build invocation entry point (e.g., workflow name), or None if not found. | ||
- The second element is the invocation URL or identifier (e.g., job URL), or None if not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear about the difference between invocation URL
and identifier
. Does it mean that if the second element of the returned tuple is not None, it could be an URL or an arbitrary string as the "identifier" of the workflow run?
# TODO: change this check if this issue is resolved: | ||
# https://github.com/orgs/community/discussions/138249 | ||
if datetime.now(timezone.utc) - timedelta(days=400) > timestamp: | ||
logger.debug("Artifact published at %s is older than 410 days.", timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug("Artifact published at %s is older than 410 days.", timestamp) | |
logger.debug("Artifact published at %s is older than 400 days.", timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have finished my review. Thanks.
Overall, there isn't any major changes needs. Most of my comments are for minor improvements/nit picking.
Refactoring the artifact pipeline detection check
mcn_infer_artifact_pipeline_1
tomcn_find_artifact_pipeline_1
.nullable
. This change enables us to store the reasons for check failures, such as when a GitHub workflow run is deleted, which may result in some previous columns lacking values.mcn_build_as_code_1 check
. If a deploy command is detected, this check will attempt to locate a successful CI pipeline that triggered the step containing the deploy command.Improvements to
mcn_build_as_code_1
infer_confidence_deploy_workflow
is added toBaseBuildTool
to infer the confidence for such Reusable workflows.The
store_inferred_build_info_results
functionstore_inferred_provenance
tostore_inferred_build_info_results
.CIInfo["provenances"]
is also renamed toCIInfo["build_info_results"]
.Provenance Extractor
ProvenanceBuildDefinition
andProvenancePredicate
. With these new abstractions, we don't need to hardcode the expectedbuildType
value while processing a provenance.find_publish_timestamp
Tutorial and integration tests
Detecting a malicious Java dependency uploaded manually to Maven Central
tutorial toDetecting Java dependencies manually uploaded to Maven Central
log4j-core
artifact instead ofguava
, which has an automated deployment workflow.log4j-core
.