Do not produce reingestion workflow documentation if it matches regular workflow documentation #3207
Labels
📄 aspect: text
Concerns the textual material in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
good first issue
New-contributor friendly
help wanted
Open to participation from the community
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: documentation
Related to Sphinx documentation
Description
Presently, our DAG documentation generation script will generate documentation for all DAGs that define a
doc_md
on the DAG:openverse/catalog/utilities/dag_doc_gen/dag_doc_generation.py
Line 122 in efbe123
For some DAGs (namely reingestion workflows, where we reuse the same DAG machinery over different windows), this produces the exact same documentation multiple times. This is unnecessarily redundant, here's an example from Flickr:
We should record the DAG doc text while iterating over the DAGs and skip producing documentation for any DAGs that have already matched a previous example exactly.
The easiest way to do this might be to record the doc markdown in a mapping from
doc markdown: dag ID
, then check for the doc markdown's presence in that mapping before producing the string. This is complicated by the fact that we want the original workflow (not the reingestion one) to be the DAG which receives the documentation, not the reingestion workflow.We could also instead add some logic when coming across DAGs with
reingestion
in the DAG ID to find and check the original DAG and see if the docstrings match. If they don't match, then continue with processing as normal, otherwise skip.The text was updated successfully, but these errors were encountered: