People capture photos and videos to relive and share memories of personal significance. Recently, media montages (stories) have become a popular mode of sharing these memories due to their intuitive and powerful storytelling capabilities. However, creating such montages usually involves a lot of manual searches, clicks, and selections that are time-consuming and cumbersome, adversely affecting user experiences. To alleviate this, we propose task-oriented dialogs for montage creation as a novel interactive tool to seamlessly search, compile, and edit montages from a media collection. To the best of our knowledge, our work is the first to leverage multi-turn conversations for such a challenging application, extending the previous literature studying simple media retrieval tasks. We collect a new dataset C3 (Conversational Content Creation), comprising 10k dialogs conditioned on media montages simulated from a large media collection. We take a simulate-and-paraphrase approach to collect these dialogs to be both cost and time efficient, while drawing from natural language distribution. Our analysis and benchmarking of state-of-the-art language models showcase the multimodal challenges present in the dataset. Lastly, we present a real-world mobile demo application that shows the feasibility of the proposed work in real-world applications. Our code and data will be made publicly available.
翻译:人们为重现和分享个人意义记忆而拍摄照片和视频。最近,媒体剪辑(故事)因其直觉和强大的故事叙事能力而成为分享这些记忆的流行模式。然而,创建这种剪辑通常需要大量手工搜索、点击和选择,这些都耗费时间和繁琐,对用户的经历产生不利影响。为了减轻这一影响,我们建议为剪辑创建以任务为导向的对话,作为无缝搜索、汇编和编辑媒体收藏的剪辑的新颖互动工具。根据我们的最佳知识,我们的工作是首先利用多点对话来进行这种具有挑战性的应用,扩大以往研究简单媒体检索任务的文献。我们收集了一个新的数据集C3(调译内容创造),其中包括10公里对话,这些对话以模拟媒体的剪辑为条件,对用户体验产生了不利影响。我们采用模拟和调整方法收集这些对话,既成本又节省时间,同时从自然语言发行中提取数据。我们对最新语言模式的分析与基准化语言模型进行对比,从而展示了我们目前可公开使用的数据格式应用中的真实可行性。最后,我们提出的数据应用中将展示现实世界应用中的现实数据。