Video summarization attracts attention for efficient video representation, retrieval, and browsing to ease volume and traffic surge problems. Although video summarization mostly uses the visual channel for compaction, the benefits of audio-visual modeling appeared in recent literature. The information coming from the audio channel can be a result of audio-visual correlation in the video content. In this study, we propose a new audio-visual video summarization framework integrating four ways of audio-visual information fusion with GRU-based and attention-based networks. Furthermore, we investigate a new explainability methodology using audio-visual canonical correlation analysis (CCA) to better understand and explain the role of audio in the video summarization task. Experimental evaluations on the TVSum dataset attain F1 score and Kendall-tau score improvements for the audio-visual video summarization. Furthermore, splitting video content on TVSum and COGNIMUSE datasets based on audio-visual CCA as positively and negatively correlated videos yields a strong performance improvement over the positively correlated videos for audio-only and audio-visual video summarization.
翻译:视频汇总吸引人们关注高效视频展示、检索和浏览的视频汇总,以缓解数量和流量激增问题。虽然视频汇总主要使用视觉压缩频道,但音像建模的好处出现在最近的文献中。来自音频频道的信息可能是视频内容中视听相关关系的结果。在这项研究中,我们提议一个新的视听视频汇总框架,将视听信息与基于GRU的网络和基于关注的网络融合的四种方式结合起来。此外,我们调查一种新的解释性方法,利用视听光学关联分析(CCA)更好地了解和解释音像建模任务中音频组合的作用。TVSum数据集的实验性评价达到F1分,肯德尔图的音频视频汇总得分改进。此外,基于音像评估的电视STVSum和COGNIMUSE数据集分割成正面和负面关联的视频,使得与只收音频和视听视频汇总的正面对应的视频视频的性能得到很大的改进。