The goal of video summarization is to automatically shorten videos such that it conveys the overall story without losing relevant information. In many application scenarios, improper video summarization can have a large impact. For example in forensics, the quality of the generated video summary will affect an investigator's judgment while in journalism it might yield undesired bias. Because of this, modeling explainability is a key concern. One of the best ways to address the explainability challenge is to uncover the causal relations that steer the process and lead to the result. Current machine learning-based video summarization algorithms learn optimal parameters but do not uncover causal relationships. Hence, they suffer from a relative lack of explainability. In this work, a Causal Explainer, dubbed Causalainer, is proposed to address this issue. Multiple meaningful random variables and their joint distributions are introduced to characterize the behaviors of key components in the problem of video summarization. In addition, helper distributions are introduced to enhance the effectiveness of model training. In visual-textual input scenarios, the extra input can decrease the model performance. A causal semantics extractor is designed to tackle this issue by effectively distilling the mutual information from the visual and textual inputs. Experimental results on commonly used benchmarks demonstrate that the proposed method achieves state-of-the-art performance while being more explainable.
翻译:视频摘要的目标是自动缩短视频,这样可以传达整体故事,同时又不会失去相关信息。在许多应用场景中,不当的视频摘要可能会产生很大的影响。例如,在取证方面,生成的视频摘要的质量将影响调查员的判断,而在新闻报道中可能会产生不良偏见。因此,建模可解释性是一个关键问题。解决可解释性挑战的最佳方法之一是发现驱动过程并导致结果的因果关系。目前,基于机器学习的视频摘要算法学习最优参数,但不揭示因果关系。因此,它们在解释性方面相对缺乏。在本研究中,提出了一种被称为Causalainer的因果解释器来解决此问题。引入多个有意义的随机变量及其联合分布来描述视频摘要问题中关键组件的行为。此外,引入辅助分布以增强模型训练的有效性。在视觉-文字输入场景中,额外的输入可能会降低模型性能。因果语义提取器被设计来解决这个问题,有效地从视觉和文本输入中提取互信息。对常用基准测试的实验结果表明,所提出的方法在表现上达到了最先进水平,同时具有更高的可解释性。