Home

Welcome to the unified-io-inference wiki!

Develop iterative approach to object detection and captioning.
- Improve and extend the current results by making iterative prompts to unified-IO
- Adapt the prompts based on the results of the preliminary round of prompts.
Gather text results from a variety of VQA prompts, including captioning and categorization.
- Captioning: "What does the image describe ?"
- Categorize: "What is in this image ?"
- Test others including "What is happening in the image ?", "Describe the scene.", "List the objects.",
Parse text answers using spacy
- Identify parts of speech
- Collect noun-phrases ("soccer player", "police officer", etc.)
- Collect template of noun-phrases, including possibly "[] sitting down","[] holding a []"
- Work to get long phrases containing multiple nouns: "a man in uniform talks to people"
Use unified-IO refexp(...)
- find bounding boxes of noun-phrases
- add error handling for cases where it fails
- refexp can also return <extra_id_[N]> tokens and plain text tokens ("person", "fire", etc.)
Submit customized captioning prompts:
- "Describe the scene, including the soccer player."
- Explore additional prompts.
- Review the literature for captioning and question answering.

Provide feedback