Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an innovative approach for creating more comprehensive spatiotemporal video descriptions. Our method employs a ChatGPT model as a controller, specifically designed to select frames for posing video content-driven questions. Subsequently, a robust algorithm is utilized to answer these visual queries. This question-answer framework effectively uncovers intricate video details and shows promise as a method for enhancing video content. Following multiple conversational rounds, ChatGPT can summarize enriched video content based on previous conversations. We qualitatively demonstrate that our Video ChatCaptioner can generate captions containing more visual details about the videos. The code is publicly available at https://github.com/Vision-CAIR/ChatCaptioner
翻译:视频字幕生成旨在使用自然语言传达视频中的动态场景,促进我们对环境中时空信息的理解。虽然近年来已经有了一些进展,但生成详细且丰富的视频描述仍然是一个重大挑战。在这项工作中,我们介绍了 Video ChatCaptioner, 一种创新的方法,用于创建更全面的时空视频描述。我们的方法采用 ChatGPT 模型作为控制器,专门设计用于选择帧以提出基于视频内容的问题。随后,我们利用了一种强大的算法来回答这些视觉问题。这种问答框架能够有效地揭示复杂的视频细节,并显示出增强视频内容的潜力。在多个交谈回合后,ChatGPT 能够基于之前的对话总结出丰富的视频内容。我们定性地证明了我们的 Video ChatCaptioner 能够生成包含更多视频细节的字幕。代码公开在 https://github.com/Vision-CAIR/ChatCaptioner