Multi-modal learning, which focuses on utilizing various modalities to improve the performance of a model, is widely used in video recognition. While traditional multi-modal learning offers excellent recognition results, its computational expense limits its impact for many real-world applications. In this paper, we propose an adaptive multi-modal learning framework, called AdaMML, that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition. Specifically, given a video segment, a multi-modal policy network is used to decide what modalities should be used for processing by the recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on four challenging diverse datasets demonstrate that our proposed adaptive approach yields 35%-55% reduction in computation when compared to the traditional baseline that simply uses all the modalities irrespective of the input, while also achieving consistent improvements in accuracy over the state-of-the-art methods.
翻译:以多种模式学习为重点,侧重于利用各种模式来提高模型性能的多种模式学习被广泛用于视频识别。传统多模式学习提供了极好的识别结果,但其计算成本限制了其对许多现实世界应用的影响。在本文中,我们提议了一个适应性多模式学习框架,称为AdamML,即时选择每个部分的最佳模式,条件是输入以高效视频识别为条件。具体地说,如果有一个视频段,则使用多模式政策网络来决定承认模式应采用何种模式处理,目的是提高准确性和效率。我们利用标准反向分析对政策网络进行高效培训,同时使用标准识别模式。关于四个具有挑战性的不同数据集的广泛实验表明,我们提议的适应方法与仅仅使用所有模式而不管投入如何的传统基线相比,在计算中将减少35%至55%。与此同时,对最新方法的准确性也实现了一致的提高。