通过将视频作为特权信息进行蒸馏,进行音频代表学习 (Audio Representation Learning by Distilling Video as Privileged Information)

Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference, leading to performance degradation by models trained for multi-modal inference. In this work, we propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft-labels generated by the teacher, in our proposed method we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and non-sequential data where the entire features are treated as one whole segment. In the non-sequential setting both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and task header. We use two sets of embeddings produced by the encoder and aggregation component of the teacher to train the student. Similar to the non-sequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audio-visual tasks, namely speaker recognition and speech emotion recognition and show considerable improvements over sole audio-based recognition as well as prior works that use LUPI.

翻译：使用多式视听数据进行深层次的音频代表学习往往会比单一模式方法产生更好的业绩。然而,在现实世界中,两种模式在推断时并非总能提供,导致经过多式推断培训的模型的性能退化。在这项工作中,当视频模式在推断时不存在时,我们建议采用一种新颖的方法,利用视听数据进行深层次的听音代表学习。为此,我们采用教师-学生知识在学习框架内使用特许信息进行精通。虽然以前为LUIP提议的方法是使用教师产生的非软标签,但在我们拟议的方法中,我们使用教师所学的嵌入式来培训学生网络。我们把我们的方法分为两种不同的环境:连续数据,当视频模式被分成多个部分时,当视频模式被全部特征作为整个部分处理时,我们建议采用一种非序列式的音频-学生语音结构,即学生语音结构的改进部分和任务页头结构。我们使用嵌入式的系统,同时使用经过训练的电路标结构结构的精密部分,同时使用经过训练的电算导的电路标结构结构部分,同时使用经过训练的电算导的电路标的导导导导导结构结构结构结构,显示显示显示的电路段段。