Video highlight detection is a crucial yet challenging problem that aims to identify the interesting moments in untrimmed videos. The key to this task lies in effective video representations that jointly pursue two goals, \textit{i.e.}, cross-modal representation learning and fine-grained feature discrimination. In this paper, these two challenges are tackled by not only enriching intra-modality and cross-modality relations for representation modeling but also shaping the features in a discriminative manner. Our proposed method mainly leverages the intra-modality encoding and cross-modality co-occurrence encoding for fully representation modeling. Specifically, intra-modality encoding augments the modality-wise features and dampens irrelevant modality via within-modality relation learning in both audio and visual signals. Meanwhile, cross-modality co-occurrence encoding focuses on the co-occurrence inter-modality relations and selectively captures effective information among multi-modality. The multi-modal representation is further enhanced by the global information abstracted from the local context. In addition, we enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning (HPCL) scheme. A hard-pairs sampling strategy is further employed to mine the hard samples for improving feature discrimination in HPCL. Extensive experiments conducted on two benchmarks demonstrate the effectiveness and superiority of our proposed methods compared to other state-of-the-art methods.
翻译:视频的发现是一个关键但具有挑战性的问题,目的是确定未剪辑的视频中的有趣时刻。这项任务的关键在于有效的视频展示,这些展示共同追求两个目标,即:\textit{i.e.}、跨模式代表性学习和细微差别特征歧视。在本文中,应对这两项挑战的方法不仅在于丰富内部和跨模式关系,以建立代表性模型,而且以歧视性方式塑造特征。我们提议的方法主要利用内部模式编码和跨模式共同编码,以充分代表性模型为目的。具体地说,内部模式编码强化了模式的特征,并通过在视听信号中进行内部模式关系学习,抑制了不相关的模式。同时,交叉模式的共变数编码侧重于共同存在的相互模式关系,并以选择性的方式在多种模式中获取有效信息。从当地背景中提取的全球信息进一步强化了多模式的编码和交叉模式共通性共同编码。此外,我们扩大了在采用硬基模范模式进行特征对比时,还采用了硬基比重的硬基调方法,以学习其他硬基调方法,以学习硬基调的硬基调方法。