Is the inference done using only one <image> token? #75

gowreesh-mago · 2024-07-09T15:22:26Z

I have run the demo.sh and the prompt seems to digest all the images but has only 1 <image> token. The results seem to get ambiguous as the video gets longer. Is it doing inference based on the first frame it sees after segmenting?

The text was updated successfully, but these errors were encountered:

Stevetich · 2024-08-12T08:59:49Z

I also found this problem, have you fixed it? Is this a bug, or the normal design?

dipta007 · 2024-09-26T02:26:10Z

same issue

ellemcfarlane · 2024-10-25T10:03:55Z

I gave the demo with 7b model a video (16 frames only) in the demo that had a scene transition e.g. of a cat walking on the road, then someone biking and it was able to caption both, so this would be a counter example to what you're seeing, no?

It does look like they may only use 4-frames though, see here:
#12 (comment)
And here:
https://github.com/magic-research/PLLaVA/blob/main/tasks/eval/demo/pllava_demo.py#L122

Edit: ah but in the bash script that actually runs the demo, frames is set to 16:
https://github.com/magic-research/PLLaVA/blob/main/scripts/demo.sh#L3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the inference done using only one <image> token? #75

Is the inference done using only one <image> token? #75

gowreesh-mago commented Jul 9, 2024

Stevetich commented Aug 12, 2024

dipta007 commented Sep 26, 2024

ellemcfarlane commented Oct 25, 2024 •

edited

Loading

Is the inference done using only one <image> token? #75

Is the inference done using only one <image> token? #75

Comments

gowreesh-mago commented Jul 9, 2024

Stevetich commented Aug 12, 2024

dipta007 commented Sep 26, 2024

ellemcfarlane commented Oct 25, 2024 • edited Loading

ellemcfarlane commented Oct 25, 2024 •

edited

Loading