Distilling Vision-Language Models on Millions of Videos

Submitted ⁨⁨2⁩ ⁨years⁩ ago⁩ by ⁨Even_Adder@lemmy.dbzer0.com⁩ to ⁨stable_diffusion@lemmy.dbzer0.com⁩

https://i.imgur.com/MpptzhT.png

Abstract

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

Paper: arxiv.org/abs/2401.06129

Code: (coming soon)

Data: (coming soon)

Technical Report: …github.io/video-instruction-tuning/

Image

source

Comments

Sort:hotnew top

msgraves@lemmy.dbzer0.com ⁨2⁩ ⁨years⁩ ago
Data: Coming soon

ohh boy it’s gonna be one of these

source
tagginator@utter.online [bot] ⁨2⁩ ⁨years⁩ ago
New Lemmy Post: Distilling Vision-Language Models on Millions of Videos (https://lemmy.dbzer0.com/post/12261121)
Tagging: #StableDiffusion
(Replying in the OP of this thread (NOT THIS BOT!) will appear as a comment in the lemmy discussion.)
I am a FOSS bot. Check my README: https://github.com/db0/lemmy-tagginator/blob/main/README.md
source