You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your interest!
I'm not sure I fully understand the question, but 4M is able to both input and output captions and bounding boxes, alongside other modalities. Note that the captions and bounding boxes are not aligned - the captions are at an image-level, while the bounding boxes are labeled with COCO classes.
Please see our Jupyter notebooks for examples.
Sorry @roman-bachmann for not being clear. My question is that given some labels as "captions", can 4M detect the bbox based on the provided "captions"?
First, thank you all for open sourcing this fantastic work.
I want to ask whether the object detection with caption feasible with this model and if yes how can I use it?
Thank you in advance!
The text was updated successfully, but these errors were encountered: