Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun [11]. We made substantial modifications to make the game richer by introducing audio and enabling new interactions. We trained RL agents with different objectives to navigate the game and interact with 13 objects and characters. This allows us to automatically extract a large collection of diverse videos and associated audio. We sample 375K video clips (3.2s each) and collect text descriptions from human annotators. Each video has additional annotations that are extracted automatically from the game engine, such as accurate semantic maps for each frame and templated textual descriptions. Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation. We benchmark representative approaches on tasks involving video-audio-text retrieval and generation. Our dataset and code are released at: https://mugen-org.github.io/.
翻译:多式视频-视频-视频-视频-视频-视频-视频-视频-视频文本理解和生成可受益于狭义但丰富的数据集。狭义允许研究界能够取得进展的细小挑战。 丰富能确保我们在核心挑战中取得进展。 为此,我们展示了大型视频-视频-视频-文本数据集MUGEN, 使用开放源平台游戏库inRun [11] 收集了大型视频-视频-视频-视频-文本数据集MUGEN。 我们做了重大修改,通过引入音频和促成新互动,使游戏更加丰富。 我们培训了具有不同目标的RL代理商,以浏览游戏,并与13个对象和字符互动。 这使得我们能够自动提取大量不同的视频视频和相关音频视频和相关音频集。 我们抽样了375K视频剪辑(各取3.2个),并收集了来自人类标识器的文本描述。 每部视频都有从游戏引擎中自动提取的附加说明,例如每个框架的准确语义图和模板文本描述。 总体而言,MUGEN能够帮助在多式理解和生成的许多任务中推进研究。 我们为涉及视频- 视频- 文本检索和生成的任务设定了有代表性的方法基准。 我们的数据设置和生成和代码发布和代码在http://ms/mugeniub.gio.gio.