Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.
翻译:电影为我们提供了大量视觉内容和吸引故事。 现有方法显示,通过视觉内容了解电影故事仍是一个棘手的问题。 在本文中,为了回答有关电影的问题,我们提出了一个多层记忆网络(LMN),分别代表静态文字记忆模块和动态字幕记忆模块的框架级和剪辑级电影内容。特别是,我们首先从培训电影字幕中提取文字和句子。然后,从LMN学到的分级构成的电影演示,不仅将文字和视觉内容在框架中的对接编码,而且还将句子和框架在电影剪辑中的对接编码。我们还将我们的LMN模型扩展为三个变量框架,以展示良好的可扩展能力。我们在MovicQA数据集上进行了广泛的实验。仅以视觉内容作为投入,具有框架级代表的LMNDM获得很大的性能改进。在将字幕纳入LMN以形成剪辑级别的演示时,我们不仅在“ Video+Subticle”在线评价任务中实现了最先进的表现,而且还成功地展示了“Video”电影的等级应用。