You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have run the demo.sh and the prompt seems to digest all the images but has only 1 <image> token. The results seem to get ambiguous as the video gets longer. Is it doing inference based on the first frame it sees after segmenting?
The text was updated successfully, but these errors were encountered:
I gave the demo with 7b model a video (16 frames only) in the demo that had a scene transition e.g. of a cat walking on the road, then someone biking and it was able to caption both, so this would be a counter example to what you're seeing, no?
I have run the
demo.sh
and the prompt seems to digest all the images but has only 1<image>
token. The results seem to get ambiguous as the video gets longer. Is it doing inference based on the first frame it sees after segmenting?The text was updated successfully, but these errors were encountered: