Despite recent advances of AI, story understanding remains an open and under-investigated problem. We collect, preprocess, and publicly release a video-language story dataset, Synopses of Movie Narratives (SYMON), containing 5,193 video summaries of popular movies and TV series. SYMON captures naturalistic story-telling videos for human audience made by human creators. As a prototypical and naturalistic story dataset, SYMON features high coverage of multimodal story events, abundant mental-state descriptions, and large semantic gaps between the visual and the textual modalities. We establish benchmarks on video-text retrieval and zero-shot alignment on movie summary videos, which showcase the importance of in-domain data in story understanding. With SYMON, we hope to lay the groundwork for progress in multimodal story understanding.
翻译:尽管人工智能(Project name in English),故事理解仍然是一个开放和尚未研究的问题。我们收集、预处理和公开发布了一个视频-语言故事数据集(Synopses of Movie Narratives (SYMON)),包括5,193个受欢迎的电影和电视剧的视频摘要。SYMON捕捉了人类创作者制作的面向人类观众的自然主义叙事视频。作为一个原型和自然主义故事数据集,SYMON具有高覆盖的多模态故事事件、丰富的心理状态描述以及视觉和文本模态之间的大量语义差距。我们在电影概述视频上建立了视频-文本检索和零-shot对齐的基准,展示了领域内数据在故事理解中的重要性。通过SYMON,我们希望为多模态故事理解的进展奠定基础。