Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifacts upload #917

Merged
merged 14 commits into from
Feb 23, 2024
Merged

Artifacts upload #917

merged 14 commits into from
Feb 23, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Feb 22, 2024

Rationale

Remark:

  • there are many changes, I suggest to review them commit by commit to gain a better understanding since commits are meant to provide meaning to the change made
  • I did not included any upgrade of other frontend or backend dependencies, I prefer to do it in a separate PR (but I did not omit this is needed as well)

Changes

  • Always add --keep to zimit offliner command
    • Needed if we want to archive these artifacts (e.g. warc.gz files)
    • We do not mind since the container is deleted almost right after anyway
    • Exposing the parameter in Zimfarm adds no value for the end user and there is significant risk of mistake (very frustrating to restart a recipe with artifacts globs configured ... only to realize after multiple hours you forgot to also activate the --keep setting)
  • Update QA tools and fix code accordingly
  • Add artifacts functionality
    • Add a artifacts_globs configuration which is a list of strings
    • Modify API and UI to manipulate this new configuration
    • Modify task worker to check if artifacts_globs is configured after scraper has completed, and if so:
      • tar the files matching any of the globs in a single tar file
        • in-tar hierarchy match the original filesystem hierarchy to avoid to have to deal with conflicting file names
      • as files are added to the tar archive, delete them from the filesystem
      • start an upload container to upload the tar file
      • configuration is done through two new environment variables (needed to be set at task request, values are saved in task afterwards to be pushed to the worker): ARTIFACTS_UPLOAD_URI and ARTIFACTS_EXPIRATION
  • Enhance checks at recipe creation to avoid inconsistencies between selected offliner and image name
  • Enhance recipe config patching to delete keys which can be unset (platform and new artifacts_globs)
  • Enhance secrets removal: always remove what looks like an S3 secret in what looks like a URL, in any key of the response, no matter where it is in the structure
  • Fix tests around recipe patching:
    • in good patch tests, the assertions made were not checking the configuration content ... which is the main part modified by the patch
    • in bad patch tests fixture, many data was not needed and confusing, patch request was reduced to the strict minimum to make the test fail (i.e. changing a single dict value in test fixture make the test fail because data becomes valid)
    • add lots of comment to ensure we know what we are testing (so that it is easier to fix when an issue in encountered)
  • Small cosmetic changes to recipe configuration UI:
    • Enhance descriptions of Docker image name and tag, and do not suggest "latest" as default value (this is not always a good idea, and not our usage anymore in none of our instances)
    • Add missing semicolons

Screenshots

Recipe configuration (view)

image

Recipe configuration (edit)

image

New button in task debug screen to download artifacts:

image

Result tar content:

image

For convenience, we always keep zimit temporary files, so that they can
be easily archived with the `artifacts_globs` configuration.
This situtaion is for now possible only for config keys, all "root"
schedule keys cannot be set to null.

When a schedule patch requires to set a config key to null, we want to:
- remove the key from the config dictionary if already present
- not set the key to null if the key was already absent from the config
@benoit74 benoit74 self-assigned this Feb 22, 2024
@benoit74 benoit74 marked this pull request as ready for review February 22, 2024 10:50
@benoit74 benoit74 requested a review from rgaudin February 22, 2024 10:50
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ; please look at inline suggestions ; doesn't require re-review

dispatcher/backend/src/common/constants.py Show resolved Hide resolved
workers/app/task/worker.py Outdated Show resolved Hide resolved
workers/app/task/worker.py Outdated Show resolved Hide resolved
workers/app/task/worker.py Outdated Show resolved Hide resolved
dispatcher/backend/src/routes/utils.py Outdated Show resolved Hide resolved
dispatcher/frontend-ui/src/components/ScheduleEditor.vue Outdated Show resolved Hide resolved
@benoit74 benoit74 force-pushed the artifacts_upload branch 2 times, most recently from be59b2e to 9970b9f Compare February 23, 2024 14:16
@benoit74 benoit74 merged commit 3e5b6e6 into main Feb 23, 2024
5 checks passed
@benoit74 benoit74 deleted the artifacts_upload branch February 23, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On-demand archiving of temporary files produced by scrapers
2 participants