Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetcher DAG development #496

Merged
merged 39 commits into from
Aug 23, 2023
Merged

Fetcher DAG development #496

merged 39 commits into from
Aug 23, 2023

Conversation

amywieliczka
Copy link
Collaborator

@amywieliczka amywieliczka commented Aug 18, 2023

This is a PR representing the work @barbarahui and I did in writing a first fetcher DAG. Since writing to s3 is not fully supported by the codebase, this work is coordinated with some updates to ucldc/aws-mwaa-local-runner described here: ucldc/aws-mwaa-local-runner#3 and is built on top of the DATA_DEST environment variable work here: #495

This PR, in contrast to the DATA_DEST environment variable work, is exclusively DAG-specific work.

  • dags/operate_sample_dag.py is a DAG written using the classic airflow PythonOperator.
  • dags/taskflow_sample_dag.py is a DAG written using the airflow taskflow model. This file also includes a variety of small tasks demonstrating specific, but relevant airflow features.
  • dags/fetcher_dag.py is a functional fetcher DAG.

In order to run the fetcher DAG, you must have:

  1. Pulled the ucldc/aws-mwaa-local-runner:dags-development branch
  2. Modified the aws-mwaa-local-runner/docker/.env file:
# aws-mwaa-local-runner/docker/.env
DAGS_HOME="<path to this repository>/rikolti"
PLUGINS_HOME="<path to this repository>/rikolti/plugins"
REQS_HOME="<path to this repository>/rikolti/dags"
STARTUP_HOME="<path to this repository>/rikolti/dags"
RIKOLTI_DATA_HOME="<path to where you would like the fetcher to write data>"
  1. Run ./mwaa-local-env build-image
  2. Write a startup.sh file in the dags folder:
# startup.sh
FETCHER_DATA_DEST=file:///usr/local/airflow/rikolti_data
MAPPER_DATA_DEST=file:///usr/local/airflow/rikolti_data
MAPPER_DATA_SRC=file:///usr/local/airflow/rikolti_data
  1. Run ./mwaa-local-env start

@barbarahui
Copy link
Collaborator

startup.sh needs to look like this:

# startup.sh
export FETCHER_DATA_DEST='file:///usr/local/airflow/rikolti_data'
export MAPPER_DATA_DEST='file:///usr/local/airflow/rikolti_data'
export MAPPER_DATA_SRC='file:///usr/local/airflow/rikolti_data'

dags/requirements.txt Outdated Show resolved Hide resolved
Copy link
Collaborator

@barbarahui barbarahui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in Slack, we have to change lines 11-12 of metadata_fetcher/lambda_function.py to be:

    fetcher_module = importlib.import_module(
        f".fetchers.{harvest_type}_fetcher", package="rikolti.metadata_fetcher")

in order for this to run in aws-mwaa-local-runner.

However, this causes a command line run such as the following to fail:

python3 -m metadata_fetcher.fetch_registry_collections "https://registry.cdlib.org/api/v1/rikoltife tcher/?format=json&mapper_type=tv_academy_oai_dc&ready_for_publication=true"

Same thing goes for using importlib.import_module in metadata_mapper/lambda_function.py

Copy link
Collaborator

@barbarahui barbarahui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution to the importlib.import_module package name problem!


Argh sorry, my bad -- please disregard everything below!

A problem I ran across is that when running the fetcher_dag in aws-mwaa-local-runner is that the Docker container needs to write to a location inside the container, rather than the mounted location on the host disk. So in metadata_fetcher/Fetcher.py on lines 48-52, it needs to be:

    def get_local_path(self):
        local_path = os.sep.join([
            '/usr/local/airflow/rikolti_data/vernacular_metadata',
            str(self.collection_id),
        ])
This of course doesn't work when you're running the code locally.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
barbarahui
barbarahui previously approved these changes Aug 23, 2023
@amywieliczka amywieliczka dismissed barbarahui’s stale review August 23, 2023 21:11

The merge-base changed after approval.

barbarahui
barbarahui previously approved these changes Aug 23, 2023
@amywieliczka amywieliczka dismissed barbarahui’s stale review August 23, 2023 22:02

The merge-base changed after approval.

@amywieliczka amywieliczka merged commit e08fbcc into main Aug 23, 2023
1 check passed
@amywieliczka amywieliczka deleted the dag-development branch August 23, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants