Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Barbara/airflow mapper #502

Merged
merged 13 commits into from
Sep 15, 2023
Merged

Barbara/airflow mapper #502

merged 13 commits into from
Sep 15, 2023

Conversation

barbarahui
Copy link
Collaborator

@amywieliczka @bibliotechy Can you take a look at this draft of this mapper dag and give feedback on the approach? Any and all feedback very welcome!

Some specific questions/comments:

  1. It seems like parameters can only be accessed inside a task run, meaning that params needs to be unpacked both in fetch_pages() and then again in map_page() (or passed from one task to another) which seems clunky. Chad, are you aware of a different approach to this?
  2. Amy, can you verify that this is the format of the payload we're expecting to get from Registry? (lines 57-61)
  3. Do we have any collections that have > 1024 pages of vernacular metadata? I'm guessing that some of the Nuxeo complex object collections exceed that. Assuming so, I'll need to figure out how to handle this case. (See my comments on lines 69-72). We can of course increase the value to > 1024, but that seems like a recipe for overloading our MWAA instance.

dags/mapper_dag.py Outdated Show resolved Hide resolved
dags/mapper_dag.py Outdated Show resolved Hide resolved
dags/mapper_dag.py Show resolved Hide resolved
dags/mapper_dag.py Outdated Show resolved Hide resolved
dags/mapper_dag.py Show resolved Hide resolved
dags/mapper_dag.py Outdated Show resolved Hide resolved
dags/mapper_dag.py Outdated Show resolved Hide resolved
@bibliotechy
Copy link
Contributor

@barbarahui @amywieliczka My biggest outstanding question is around how we structure the full workflow, from fetching the vernacular to publishing the updated collection to stage calisphere.

If each "part" of the workflow (fetch, map, content fetch, etc) is its own dag, how will we build a single workflow where an operator can trigger a dag with a single collection ID and have the entire workflow execute?

@amywieliczka
Copy link
Collaborator

@bibliotechy agreed - I think a harvest_collection dag makes sense, which is comprised of the full pipeline. Though also, I don't think what's done here poses a major problem, and for development iteration purposes, it might be nice to have an isolated component up until the point that it is considered done?

My understanding with the taskflow API is that you can treat tasks like functions for use in multiple dags, and I think even import them between different dag definition python modules. I don't think that's particularly ideal, though.

@barbarahui
Copy link
Collaborator Author

barbarahui commented Aug 25, 2023

@bibliotechy If we put all of the tasks in a single DAG so that it's possible to run a harvest from start to finish, how do we allow users to run just a subset of tasks, e.g. if somebody wants to just rerun the mapper? I guess we add a parameter that allows the user to specify which task(s) to run, and then exit the task immediately if not included in the list of tasks to run? Is that the right model?

@christinklez christinklez linked an issue Sep 13, 2023 that may be closed by this pull request
@barbarahui barbarahui marked this pull request as ready for review September 15, 2023 17:47
@amywieliczka amywieliczka merged commit 79bb2f3 into main Sep 15, 2023
2 checks passed
@amywieliczka amywieliczka deleted the barbara/airflow_mapper branch September 15, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write the mapper Dag
3 participants