Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data. This paper first verifies our conjecture by observing that, even in the latest VATT pre-training using only instructional videos, there exist strong gradient conflicts between different CMA losses within the same video, audio, text triplet, indicating them as the noisy source of supervision. We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients for each sample triplet, so that their gradient directions are more aligned; and (ii) gradient-based curriculum learning: leveraging the gradient conflict information on an indicator of sample noisiness, to develop a curriculum learning strategy to prioritize training on less noisy sample triplets. Applying those techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts.
翻译:培训前自我监督的自我监督最近显示,大型多式联运数据取得了成功,而最先进的对比性学习方法往往能从视频/音频或视频/文本配对等跨模式投入中强化特征的一致性。 尽管在实际中制定和运用方便,但这种跨模式协调(CMA)只是个薄弱和噪音的监督,因为两种模式即使是在时间上也可能具有语义上的错配,即使它们只是暂时地对齐。例如,即使是在通常采用的指导性视频中,一个发言者有时也可以提到当前框架中没有可见的某个东西;语义上的调调差只会使互联网的原始视频更加不可预测。我们推测可能会在方式上造成冲突和偏差,从而可能因此禁止CMA扩大培训,使用更大和更多混杂的数据。本文首先通过观察,即使在最新的VATT前培训中,仅使用教学性视频,在同一视频、音频、文字三联调中,不同CMA损失的等级之间也存在强烈的冲突性冲突性冲突性冲突性冲突性。MLOVAT-rodeal-dealtrading the dislational extrading extravely le:我们提议通过两个数据升级技术来调整。