承认视听行动承认的 " 团结学习 " (Noise-Tolerant Learning for Audio-Visual Action Recognition)

from arxiv, This work is going to be submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating multiple modalities to improve the performance or robustness of a model. Although various multi-modal learning methods have been proposed and offer remarkable recognition results, almost all of these methods rely on high-quality manual annotations and assume that modalities among multi-modal data provide relevant semantic information. Unfortunately, most widely used video datasets are collected from the Internet and inevitably contain noisy labels and noisy correspondence. To solve this problem, we use the audio-visual action recognition task as a proxy and propose a noise-tolerant learning framework to find anti-interference model parameters to both noisy labels and noisy correspondence. Our method consists of two phases and aims to rectify noise by the inherent correlation between modalities. A noise-tolerant contrastive training phase is performed first to learn robust model parameters unaffected by the noisy labels. To reduce the influence of noisy correspondence, we propose a cross-modal noise estimation component to adjust the consistency between different modalities. Since the noisy correspondence existed at the instance level, a category-level contrastive loss is proposed to further alleviate the interference of noisy correspondence. Then in the hybrid supervised training phase, we calculate the distance metric among features to obtain corrected labels, which are used as complementary supervision. In addition, we investigate the noisy correspondence in real-world datasets and conduct comprehensive experiments with synthetic and real noise data. The results verify the advantageous performance of our method compared to state-of-the-art methods.

翻译：最近,在多模式学习的帮助下,视频识别正在出现,其重点是整合多种模式,以提高模型的性能或稳健性。尽管提出了各种多模式学习方法,并提供了显著的认可结果,但几乎所有这些方法都依赖高质量的人工说明,并假设多模式数据中的模式提供了相关的语义信息。不幸的是,大多数广泛使用的视频数据集是从互联网上收集的,不可避免地含有噪音标签和吵闹的通信。为了解决这个问题,我们利用视听行动识别任务作为代理,并提议一个防噪音学习框架,以找到噪音标签和吵闹的通信的防干扰模型参数。我们的方法由两个阶段组成,目的是通过模式之间的内在关联来纠正噪音。一个防噪音的对比培训阶段首先学习不受噪音标签噪音影响的强势模型参数。为了减少噪音通信的影响,我们建议一个跨模式的噪音估算部分来调整不同模式之间的一致性。由于在实例层面存在噪音通信,我们提议在类比度上损失一个类比级反噪音学习框架,以进一步减轻噪音真实通信的干扰。然后,在混合监督性培训阶段,我们用模型来算算我们用来校正的模型,我们用来校正数据,我们用来校正。