In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice.
翻译:近年来,利用YouTube的大规模视听信息,在名人的脸和声音之间建立了联系,利用YouTube提供的大规模视听信息。大规模视听数据集的提供有助于根据标准的进化神经网络开发语音识别方法。因此,本文件的目的是利用大规模视听信息来改进语音识别任务。为了完成这项任务,我们建议建立一个双部门网络,在多式联运系统中学习对脸和声音的联合表述。随后,从双部门网络中提取一些功能,以培训一个语音识别师。我们用一个名为VoxCeleb$1的大规模视听数据集评估了我们提议的框架。我们的结果显示,增加面部信息提高了语音识别的绩效。此外,我们的结果表明,脸和声音之间存在重叠。</s>