Official implementation of "MTRAG: Multi-Target Referring and Grounding via Hybrid Semantic-Spatial Integration".
As the paper is currently under peer review, we are releasing only the full-capability model MTRAG-Full (without additional fine-tuning) and the evaluation script for MTR-Bench at this stage. The complete codebase and model checkpoints will be made publicly available upon acceptance
See install for details.
MTRAG needs loading vicuna-7b-v1.5 pre-trained weights.
Our Global Image Encoder is initialized with the pre-trained weights of Alpha-CLIP-L/14@336px, which has been fine-tuned on the GRIT-20M dataset. Place the downloaded weights in the path ./alpha_clip.
Our grounding branch, including both the perception encoder and decoder, is initialized from the ViT-H backbone of the Segment Anything Model (SAM) ViT-H SAM model. The encoder is kept frozen during training. Place the downloaded weights in the path ./checkpoints.
See datasets for details.
MTRAG-Full model🤗: MTRAG-Full
See evaluation for details.
Thanks for great works of GLaMM, LLaVA and SAM. Our code is based on them.