Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create recipes for TED by topics #930

Merged
merged 1 commit into from
Mar 25, 2024
Merged

Create recipes for TED by topics #930

merged 1 commit into from
Mar 25, 2024

Conversation

benoit74
Copy link
Collaborator

Rationale

  • Automatically create one TED recipe per TED topic

This is a maintenance script. It is not expected to be used on a regular basis but still useful to keep / share.

@benoit74 benoit74 self-assigned this Feb 27, 2024
Copy link

codecov bot commented Feb 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.84%. Comparing base (818cc1a) to head (9fa9d71).

❗ Current head 9fa9d71 differs from pull request most recent head 557922d. Consider uploading reports for the commit 557922d to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #930      +/-   ##
==========================================
- Coverage   87.98%   87.84%   -0.14%     
==========================================
  Files          94       93       -1     
  Lines        5327     5307      -20     
==========================================
- Hits         4687     4662      -25     
- Misses        640      645       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 force-pushed the create_ted_topics branch from 931a7e5 to 9fa9d71 Compare March 19, 2024 12:53
@benoit74 benoit74 marked this pull request as ready for review March 19, 2024 12:53
@benoit74 benoit74 requested a review from rgaudin March 19, 2024 12:53
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is for, what will be the Content-Team actions following running this script?

  • Will those be individually checked? What's the strategy regarding title/description metadata? Do you want to run and fail all those that will be above 30/80c?
  • What about the mul and static list of languages? Do all topics have all those languages? In that order of importance? If not, how will this be acknowledged and fixed by content team?
  • With enabled=True we are talking about hundreds of runs that will need to be checked. Alternative is to enabled=False and let Content team run/review at their own pace.
  • How many recipes is this?

@benoit74
Copy link
Collaborator Author

Can you explain what this is for, what will be the Content-Team actions following running this script?

This is the script which has been used to generate all TED recipes by topics. The goal is only to not loose it so that it might be reused later on if needed.

See openzim/zim-requests#789 as well for answers to some of your questions.

Will those be individually checked? What's the strategy regarding title/description metadata?

They will be individually checked before being move to "prod" (library.kiwix.org)

Do you want to run and fail all those that will be above 30/80c?

I don't understand this question, sorry

What about the mul and static list of languages? Do all topics have all those languages? In that order of importance? If not, how will this be acknowledged and fixed by content team?

Good point, it reminded me I forgot to create some issues in TED scraper.

This is kinda of hack to retrieve videos in all languages (it is not possible to say "all" in languages, see openzim/ted#171).

Clearly not all topics have those languages, and order is not handled either (the scraper should filter and order them, see openzim/ted#172)

I did think this is going to be acknowledged and fixed by content team, I consider this should be fixed by dev team ; doing it it manually while it is mostly straightforward to automate with code is a bit sad

With enabled=True we are talking about hundreds of runs that will need to be checked. Alternative is to enabled=False and let Content team run/review at their own pace.

This has already ran and allowed to proceed quickly ^^

How many recipes is this?

355 recipes, 355 ZIMs

@rgaudin
Copy link
Member

rgaudin commented Mar 19, 2024

This is the script which has been used to generate all TED recipes by topics. The goal is only to not loose it so that it might be reused later on if needed.

Well if it's to be kept for reference, don't request a review!

Do you want to run and fail all those that will be above 30/80c?

I was wondering if you wanted to check the length of all those title/desc with the script or just let it run and have the scraper fail if it didn't fit. I understand from the last answer that all recipes created their ZIM so all metadata did fit.

the scraper should filter and order them, see openzim/ted#172

👍 I think this should be fixed before moving to prod if we are sending this to Languages metadata (readers trust this)

@benoit74 benoit74 force-pushed the create_ted_topics branch from 9fa9d71 to 557922d Compare March 25, 2024 09:22
@benoit74 benoit74 merged commit 48b9276 into main Mar 25, 2024
5 checks passed
@benoit74 benoit74 deleted the create_ted_topics branch March 25, 2024 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants