Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. In order to learn discriminative features for a classifier, it is pivotal to identify the helpful (or positive) audio-visual segment pairs while filtering out the irrelevant ones, regardless whether they are synchronized or not. To this end, we propose a new positive sample propagation (PSP) module to discover and exploit the closely related audio-visual pairs by evaluating the relationship within every possible pair. It can be done by constructing an all-pair similarity map between each audio and visual segment, and only aggregating the features from the pairs with high similarity scores. To encourage the network to extract high correlated features for positive samples, a new audio-visual pair similarity loss is proposed. We also propose a new weighting branch to better exploit the temporal correlations in weakly supervised setting. We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings, thus verifying the effectiveness of our method.
翻译:视觉和音频信号通常在自然环境中并存,形成视听活动(AVes) 。根据视频,我们的目标是将含有AVE的视频段本地化,并确定其类别。为了了解分类者的区别特征,关键是要确定有用的(或积极的)视听部分配对,同时过滤无关的配对,而不管它们是否同步。为此,我们提议一个新的正面的样本传播模块,通过评估每对可能的关系,发现和利用密切相关的视听配对。我们可以在每一个音频和视觉部分之间绘制一个完全相似的地图,并只汇总具有高度相似分数的配对的特征。为了鼓励网络为正样样本提取高关联性特征,我们提议一个新的视听配对类似损失。我们还提议一个新的加权分支,以更好地利用监管不力的环境中的时间相关性。我们在公共的AVE数据集上进行广泛的实验,并在完全和薄弱的监督下实现新的状态的准确性,从而验证我们的方法的有效性。