Despite recent advances of AI, story understanding remains an open and under-investigated problem. We collect, preprocess, and publicly release a video-language story dataset, Synopses of Movie Narratives (SyMoN), containing 5,193 video summaries of popular movies and TV series with a total length of 869 hours. SyMoN captures naturalistic storytelling videos made by human creators and intended for a human audience. As a prototypical and naturalistic story dataset, SyMoN features high coverage of multimodal story events and abundant mental-state descriptions. Its use of storytelling techniques cause cross-domain semantic gaps that provide appropriate challenges to existing models. We establish benchmarks on video-text retrieval and zero-shot alignment on movie summary videos, which showcase the importance of in-domain data and long-term memory in story understanding. With SyMoN, we hope to lay the groundwork for progress in multimodal story understanding.
翻译:尽管AI的进展近年来有所提高,故事理解仍然是一个开放且未经调查的问题。我们收集、预处理和公开发布了一个视频-语言故事数据集 SyMoN,其中包含5,193个受欢迎电影和电视剧的视频摘要,总长度为869小时。SyMoN捕捉了人类创作者制作并面向人类观众的自然叙述视频。作为一个原型和自然叙述数据集,SyMoN具有高覆盖的多模态故事事件和丰富的心理状态描述。它使用的叙事技巧导致跨域语义差距,为现有模型提供了适当的挑战。我们在电影摘要视频上建立了视频-文本检索和零样本对齐的基准测试,展示了在故事理解中领域内数据和长期记忆的重要性。我们希望通过SyMoN为多模态故事理解的进展奠定基础。