This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries. This task is challenging because only overall labels indicating the video events are provided for training. However, an event might be labeled but not appear in one of the modalities, which results in a modality-specific noisy label problem. Motivated by two observations that networks tend to learn clean samples first and that a labeled event would appear in at least one modality, we propose a training strategy to identify and remove modality-specific noisy labels dynamically. Specifically, we sort the losses of all instances within a mini-batch individually in each modality, then select noisy samples according to relationships between intra-modal and inter-modal losses. Besides, we also propose a simple but valid noise ratio estimation method by calculating the proportion of instances whose confidence is below a preset threshold. Our method makes large improvements over the previous state of the arts (e.g., from 60.0% to 63.8% in segment-level visual metric), which demonstrates the effectiveness of our approach.
翻译:本文侧重于监督不力的视听视频分析任务, 目的是识别属于每种模式的所有事件, 并将其时间界限本地化。 此项任务具有挑战性, 因为只提供显示视频事件的总体标签用于培训。 但是, 事件可能会被贴上标签, 但没有出现在其中一种模式中, 导致特定模式的吵闹标签问题。 基于两种观察, 即网络往往首先学习干净的样本, 标签事件至少以一种方式出现, 我们提出了一个培训战略, 以动态方式识别并删除特定模式的噪音标签。 具体地说, 我们将所有事件的损失分别分类在每种模式的微型批次中进行分类, 然后根据内部模式损失和模式间损失之间的关系选择吵闹的样本。 此外, 我们还提出一个简单而有效的噪音比率估算方法, 计算信任低于预定阈值的比例。 我们的方法大大改进了以往的艺术状况( 例如, 区段级视觉测量从60. 0 % 到63.8 % ), 这表明我们的方法的有效性。