We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model. The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames. Unlike zero-shot image captioning methods, our work considers the entire sentence at once. This is achieved by optimizing, during the generation process, part of the prompt from scratch, by modifying the representation of all other tokens in the prompt, and by repeating the process iteratively, gradually improving the specificity and comprehensiveness of the generated sentence. Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge. Our code is available at: https://github.com/YoadTew/zero-shot-video-to-text
翻译:我们采用了一种零光视频字幕方法,使用两个冻结的网络:GPT-2语言模式和CLIP图像-文本匹配模式。匹配评分用于引导语言模式生成一个与视频框架子集具有高平均匹配得分的句子。与零光图像字幕方法不同,我们的工作是同时考虑整个句子。这是通过在生成过程中优化从零开始的部分提示,通过迅速修改所有其他标牌的表示方式,通过迭接过程,逐步改进生成的句子的特殊性和全面性。我们的实验显示,生成的标语具有一致性,并展示了广泛的真实世界知识。我们的代码可以在https://github.com/YoadTew/zero-shot-vid-to-text上找到。