Self-supervised learning has attracted plenty of recent research interest. However, most works for self-supervision in speech are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for cross-modal self-supervision. This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes an audio-only self-supervision approach for speech representation learning; (3) shows that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features that are more robust in noisy conditions; (4) shows that self-supervised pretraining can outperform fully supervised training and is especially useful to prevent overfitting on smaller sized datasets. We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition. We outperform existing self-supervised methods for all tested downstream tasks. Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.
翻译:自我监督的学习最近引起了许多研究兴趣,然而,大多数自我监督的语音作品通常都是单一方式,而且研究跨现代自我监督的视听模式之间相互作用的工作有限。这项工作(1) 通过面部重建调查视觉自我监督的视觉监督,指导听力表现的学习;(2) 提议对语音代表学习采用一种只听音的自监督方法;(3) 表明拟议的视觉和听觉自监督的多任务组合有利于学习在吵闹的条件下更强大的较丰富的功能;(4) 表明自我监督的训练前可以超越完全监督的培训,特别有助于防止过大尺寸的数据集。我们评估我们所学的听力表现,以了解离散的情感,持续地影响识别和自动语音识别。我们超越了所有经过测试的下游任务的现有自监督方法。我们的结果表明,视觉自我监督的视听特征学习潜力,并表明视觉和听力自我监督的联合自我监督可导致为语音和情感识别提供更丰富的音频的听力表现。