A hacky video captioning framework, using a small VLM (moondream v2) and text language model (Llama 3.2-1b).
Uses a vision language model to generate text captions from single frames of a video (which is essesntially a sequence of frames), and then a large langauge model to merge the several captions into one single coherent caption. \
This isn't an 'SOTA' model, just an experiment.
- moondream v2, lightweight vision language model by vikhyat
- Meta AI's Llama 3.2-1b-instruct, small and capable, instruction-tuned text language model