You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for your work. The demo on Hugging-Face is charming.
I am using Qwen-2.5-VL family for object detection these days, both 3B / 7B / 72B demonstrate strong visual positioning capabilities. I ask for the bounding box coordinates of the target object and make it arranged on .json format. After several rounds of prompt word engineering, I can get promising results from the model. Normally this is not a regular VL task, but the Qwen does really well.
So what I'm curious about is whether the SFT and R1 training strategies you mentioned are essentially "how to automatically organize prompts to make object recognition tasks easier", which I mean the automatically prompt engineering, but no new knowledge emerged for the bbox-style detection task?
Looking forward to hearing your answer.
The text was updated successfully, but these errors were encountered:
The current results we show is training the model with only a few hundreds steps on the REC task, so we believe it's more like aligning the model to the task better with GRPO loss (i.e. not much new knowledge is introduced to the model).
That said the parameters are updated via GRPO training, and when we training the model longer with more tasks, we do believe new knowledge can be injected.
Hi, thanks for your work. The demo on Hugging-Face is charming.
I am using Qwen-2.5-VL family for object detection these days, both 3B / 7B / 72B demonstrate strong visual positioning capabilities. I ask for the bounding box coordinates of the target object and make it arranged on .json format. After several rounds of prompt word engineering, I can get promising results from the model. Normally this is not a regular VL task, but the Qwen does really well.
So what I'm curious about is whether the SFT and R1 training strategies you mentioned are essentially "how to automatically organize prompts to make object recognition tasks easier", which I mean the automatically prompt engineering, but no new knowledge emerged for the bbox-style detection task?
Looking forward to hearing your answer.
The text was updated successfully, but these errors were encountered: