Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative features for each video segment. Unlike existing work focusing on audio-visual feature fusion, in this paper, we propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning. The contribution of CPSP is to introduce the available full or weak label as a prior that constructs the exact positive-negative samples for contrastive learning. Specifically, the CPSP involves comprehensive contrastive constraints: pair-level positive sample propagation (PSP), segment-level and video-level positive sample activation (PSA$_S$ and PSA$_V$). Three new contrastive objectives are proposed (\emph{i.e.}, $\mathcal{L}_{\text{avpsp}}$, $\mathcal{L}_\text{spsa}$, and $\mathcal{L}_\text{vpsa}$) and introduced into both the fully and weakly supervised AVE localization. To draw a complete picture of the contrastive learning in AVE localization, we also study the self-supervised positive sample propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined audio-visual features that are distinguishable from the negatives, thus benefiting the classifier prediction. Extensive experiments on the AVE and the newly collected VGGSound-AVEL100k datasets verify the effectiveness and generalization ability of our method.
翻译:视觉和音频信号通常在自然环境中共存,形成视听活动(AVES)。根据视频,我们的目标是将含有AVE的视频片段本地化,并确定其类别。学习每个视频段的差别性特征至关重要。与目前侧重于视听特征融合的工作不同,我们在本文件中提出了一个新的对比积极的样本传播方法(CPSP),以更好地进行深度特征代表学习。CPSP的贡献是引入现有的完整或薄弱标签,作为用于为对比性学习建立准确正反向样本的先行。具体地说,CPSP包含全面的对比性限制:双级正向样本传播(PSP)、分级和视频级正向样本激活(PSA$和PSA$V$V$)。提出了三个新的对比性目标(\ emph{i.e}, $mathcal{L{text{avpsp ⁇ $, $\mathcal{L{text{spspsa}, 和 $\mathalizalizalizalizalizalal), 也引入了对A-VS全面和监管性图像升级的自我升级的自我学习。