Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the inference done using only one <image> token? #75

Open
gowreesh-mago opened this issue Jul 9, 2024 · 3 comments
Open

Is the inference done using only one <image> token? #75

gowreesh-mago opened this issue Jul 9, 2024 · 3 comments

Comments

@gowreesh-mago
Copy link

I have run the demo.sh and the prompt seems to digest all the images but has only 1 <image> token. The results seem to get ambiguous as the video gets longer. Is it doing inference based on the first frame it sees after segmenting?

@Stevetich
Copy link

I also found this problem, have you fixed it? Is this a bug, or the normal design?

@dipta007
Copy link

same issue

@ellemcfarlane
Copy link

ellemcfarlane commented Oct 25, 2024

I gave the demo with 7b model a video (16 frames only) in the demo that had a scene transition e.g. of a cat walking on the road, then someone biking and it was able to caption both, so this would be a counter example to what you're seeing, no?

It does look like they may only use 4-frames though, see here:
#12 (comment)
And here:
https://github.com/magic-research/PLLaVA/blob/main/tasks/eval/demo/pllava_demo.py#L122

Edit: ah but in the bash script that actually runs the demo, frames is set to 16:
https://github.com/magic-research/PLLaVA/blob/main/scripts/demo.sh#L3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants