Skip to content
Daniel Napierski edited this page Nov 4, 2022 · 3 revisions

Welcome to the unified-io-inference wiki!

TODO:

  • Develop iterative approach to object detection and captioning.
    • Improve and extend the current results by making iterative prompts to unified-IO
    • Adapt the prompts based on the results of the preliminary round of prompts.
  • Gather text results from a variety of VQA prompts, including captioning and categorization.
    • Captioning: "What does the image describe ?"
    • Categorize: "What is in this image ?"
    • Test others including "What is happening in the image ?", "Describe the scene.", "List the objects.",
  • Parse text answers using spacy
    • Identify parts of speech
    • Collect noun-phrases ("soccer player", "police officer", etc.)
    • Collect template of noun-phrases, including possibly "[] sitting down","[] holding a []"
    • Work to get long phrases containing multiple nouns: "a man in uniform talks to people"
  • Use unified-IO refexp(...)
    • find bounding boxes of noun-phrases
    • add error handling for cases where it fails
    • refexp can also return <extra_id_[N]> tokens and plain text tokens ("person", "fire", etc.)
  • Submit customized captioning prompts:
    • "Describe the scene, including the soccer player."
    • Explore additional prompts.
    • Review the literature for captioning and question answering.
Clone this wiki locally