Nowadays, the videos on the Internet are prevailing. The precise and in-depth understanding of the videos is a difficult but valuable problem for both platforms and researchers. The existing video understand models do well in object recognition tasks but currently still cannot understand the abstract and contextual features like highlight humor frames in comedy videos. The current industrial works are also mainly focused on the basic category classification task based on the appearances of objects. The feature detection methods for the abstract category remains blank. A data structure that includes the information of video frames, audio spectrum and texts provide a new direction to explore. The multimodal models are proposed to make this in-depth video understanding mission possible. In this paper, we analyze the difficulties in abstract understanding of videos and propose a multimodal structure to obtain state-of-the-art performance in this field. Then we select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance. At last, we evaluate the overall spotlights and drawbacks of the models and methods in this paper and point out the possible directions for further improvements.
翻译:目前,互联网上的视频很普遍。对视频的准确和深入理解对于平台和研究人员来说都是一个困难但宝贵的问题。现有的视频理解模型在目标识别任务方面做得很好,但目前仍然无法理解抽象和背景特征,如喜剧视频中的突出幽默框架。目前的工业作品还主要侧重于基于物体外观的基本分类任务。抽象类别的特征检测方法仍然空白。包含视频框架、频谱和文本信息的数据结构提供了新的探索方向。提出多式联运模型是为了让这一深入视频理解任务成为可能。在本文件中,我们分析了抽象理解视频的困难,并提出了获得该领域最新业绩的多式联运结构。然后我们选择了多种视频理解的若干基准,并运用了最合适的模型来找到最佳性能。最后,我们评估了本文中模型和方法的总体焦点和缺陷,并指出了可能的进一步改进方向。