-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Barbara/airflow mapper #502
Conversation
@barbarahui @amywieliczka My biggest outstanding question is around how we structure the full workflow, from fetching the vernacular to publishing the updated collection to stage calisphere. If each "part" of the workflow (fetch, map, content fetch, etc) is its own dag, how will we build a single workflow where an operator can trigger a dag with a single collection ID and have the entire workflow execute? |
@bibliotechy agreed - I think a My understanding with the taskflow API is that you can treat tasks like functions for use in multiple dags, and I think even import them between different dag definition python modules. I don't think that's particularly ideal, though. |
@bibliotechy If we put all of the tasks in a single DAG so that it's possible to run a harvest from start to finish, how do we allow users to run just a subset of tasks, e.g. if somebody wants to just rerun the mapper? I guess we add a parameter that allows the user to specify which task(s) to run, and then exit the task immediately if not included in the list of tasks to run? Is that the right model? |
… from lambda_shepherd.map_collection()
@amywieliczka @bibliotechy Can you take a look at this draft of this mapper dag and give feedback on the approach? Any and all feedback very welcome!
Some specific questions/comments:
fetch_pages()
and then again inmap_page()
(or passed from one task to another) which seems clunky. Chad, are you aware of a different approach to this?