Update content categorisation process#1136
Conversation
|
After discussion with Juliette and Mike:
|
4a28f08 to
38888da
Compare
|
@gvzdv can't remember if you're blocked on me for this? Should I review now? If so, pls assign to me. |
|
@jeffbl, no, you are not blocking me. I'll have time next week to eliminate merge conflicts and add an example. |
7c2ed2e to
1a956d5
Compare
|
@jeffbl as discussed, I added the example of category verification to the and added the dependency to The same procedure should be implemented in all the preprocessors that rely on the |
|
@shahdyousefak Can you give a rough estimate of the time needed for you to update all the preprocessors and handlers that check for categories, including building and testing them, based on Mike's example above? Can also assign to @jaydeepsingh25, but you have more experience doing these cross-component updates... |
i think 8-10 hours is fair |
Resolves #1015.
Updated the
content-categoriserand the corresponding schema to produce boolean tags for the following categories:The new schema is passed to the LLM alongside the updated prompt that includes the instructions and categories (this is redundant and requires prompt updates on schema updates, but improves the accuracy).
Examples:
Graphic category JSON: {'photo': True, 'diagram': False, 'flow_diagram': False, 'contains_text': False, 'people': False, 'animals': True, 'collage': False, 'chart_or_graph': False, 'illustration': False}
Graphic category JSON: {'photo': False, 'diagram': True, 'flow_diagram': True, 'contains_text': True, 'people': False, 'animals': True, 'collage': False, 'chart_or_graph': False, 'illustration': True}
Graphic category JSON: {'photo': False, 'diagram': True, 'flow_diagram': False, 'contains_text': True, 'people': False, 'animals': False, 'collage': False, 'chart_or_graph': True, 'illustration': False}
Graphic category JSON: {'photo': False, 'diagram': False, 'flow_diagram': False, 'contains_text': True, 'people': False, 'animals': False, 'collage': True, 'chart_or_graph': False, 'illustration': True}
Note: this PR will require some cleanup after we discuss the following questions.
{"category": graphic_category_output})?diagramandchart_or_graphin the same way (bothTrueor bothFalse). I agree that sometimes it's hard to distinguish between these categories. Should we keep them separate and update the description to improve the accuracy, or combine them in one category (data_visualization_or_diagram)?Required Information
Coding/Commit Requirements
New Component Checklist (mandatory for new microservices)
docker-compose.ymlandbuild.yml..github/workflows.README.mdfile that describes what the component does and what it depends on (other microservices, ML models, etc.).OR