Skip to content

Update content categorisation process#1136

Merged
jaydeepsingh25 merged 7 commits intomainfrom
update-categories
Sep 26, 2025
Merged

Update content categorisation process#1136
jaydeepsingh25 merged 7 commits intomainfrom
update-categories

Conversation

@gvzdv
Copy link
Contributor

@gvzdv gvzdv commented Aug 19, 2025

Resolves #1015.

Updated the content-categoriser and the corresponding schema to produce boolean tags for the following categories:

[
        "photo",
        "diagram",
        "flow_diagram",
        "contains_text",
        "people",
        "animals",
        "collage",
        "chart_or_graph",
        "illustration"
    ]

The new schema is passed to the LLM alongside the updated prompt that includes the instructions and categories (this is redundant and requires prompt updates on schema updates, but improves the accuracy).

Examples:

image

Graphic category JSON: {'photo': True, 'diagram': False, 'flow_diagram': False, 'contains_text': False, 'people': False, 'animals': True, 'collage': False, 'chart_or_graph': False, 'illustration': False}


image

Graphic category JSON: {'photo': False, 'diagram': True, 'flow_diagram': True, 'contains_text': True, 'people': False, 'animals': True, 'collage': False, 'chart_or_graph': False, 'illustration': True}


image

Graphic category JSON: {'photo': False, 'diagram': True, 'flow_diagram': False, 'contains_text': True, 'people': False, 'animals': False, 'collage': False, 'chart_or_graph': True, 'illustration': False}


image

Graphic category JSON: {'photo': False, 'diagram': False, 'flow_diagram': False, 'contains_text': True, 'people': False, 'animals': False, 'collage': True, 'chart_or_graph': False, 'illustration': True}


Note: this PR will require some cleanup after we discuss the following questions.

  1. Do we need to add/remove any categories?
  2. Do we need to structure the schema/output in a different way (e.g. construct it as it was constructed previously: {"category": graphic_category_output})?
  3. The model is generally good with the categorization (i.e. it identified the collage in example 4 when our custom collage detector did not), but it usually tags diagram and chart_or_graph in the same way (both True or both False). I agree that sometimes it's hard to distinguish between these categories. Should we keep them separate and update the description to improve the accuracy, or combine them in one category (data_visualization_or_diagram)?

Required Information

  • I referenced the issue addressed in this PR.
  • I described the changes made and how these address the issue.
  • I described how I tested these changes.

Coding/Commit Requirements

  • I followed applicable coding standards where appropriate (e.g., PEP8)
  • I have not committed any models or other large files.

New Component Checklist (mandatory for new microservices)

  • I added an entry to docker-compose.yml and build.yml.
  • I created A CI workflow under .github/workflows.
  • I have created a README.md file that describes what the component does and what it depends on (other microservices, ML models, etc.).

OR

  • I have not added a new component in this PR.

@gvzdv gvzdv requested a review from jeffbl August 19, 2025 18:39
@jeffbl
Copy link
Member

jeffbl commented Aug 25, 2025

After discussion with Juliette and Mike:

  • each category is boolean
  • schema changes to make all categories OPTIONAL (missing=unknown/not looked for, true=confident positive, false=confident negative)
  • further categories can be added to schema without failing validation, but any new properties are validated to be boolean
  • our categorizer will take the list in the schema as the list to be checked by LLM (only place the list is defined)

@jeffbl jeffbl assigned gvzdv and unassigned jeffbl Aug 26, 2025
@gvzdv gvzdv force-pushed the update-categories branch 3 times, most recently from 4a28f08 to 38888da Compare August 26, 2025 20:04
@jeffbl
Copy link
Member

jeffbl commented Sep 5, 2025

@gvzdv can't remember if you're blocked on me for this? Should I review now? If so, pls assign to me.
And I need a pointer to the example so I can get an estimate from @shahdyousefak so Jeremy can decide who implements across the board.

@gvzdv
Copy link
Contributor Author

gvzdv commented Sep 7, 2025

@jeffbl, no, you are not blocking me. I'll have time next week to eliminate merge conflicts and add an example.

@gvzdv
Copy link
Contributor Author

gvzdv commented Sep 11, 2025

@jeffbl as discussed, I added the example of category verification to the multistage-diagram-segmentation preprocessor:

    preprocess_output = content["preprocessors"]
    categoriser = "ca.mcgill.a11y.image.preprocessor.contentCategoriser"
    if categoriser in preprocess_output:
        categoriser_output = preprocess_output[categoriser]
        categoriser_tags = categoriser_output["categories"]
        if not categoriser_tags["multistage_diagram"]:
            logging.info("Not a multistage diagram. Skipping...")
            return "", 204

and added the dependency to docker-compose:

labels:
      ca.mcgill.a11y.image.required_dependencies: "content-categoriser"

The same procedure should be implemented in all the preprocessors that rely on the content-categoriser output.

@gvzdv gvzdv assigned jeffbl and unassigned gvzdv Sep 11, 2025
@jeffbl
Copy link
Member

jeffbl commented Sep 13, 2025

@shahdyousefak Can you give a rough estimate of the time needed for you to update all the preprocessors and handlers that check for categories, including building and testing them, based on Mike's example above? Can also assign to @jaydeepsingh25, but you have more experience doing these cross-component updates...

@jeffbl jeffbl assigned shahdyousefak and unassigned jeffbl Sep 13, 2025
@shahdyousefak
Copy link
Contributor

shahdyousefak commented Sep 18, 2025

@shahdyousefak Can you give a rough estimate of the time needed for you to update all the preprocessors and handlers that check for categories, including building and testing them, based on Mike's example above? Can also assign to @jaydeepsingh25, but you have more experience doing these cross-component updates...

i think 8-10 hours is fair

@jaydeepsingh25 jaydeepsingh25 merged commit 847bd99 into main Sep 26, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants