The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies contains utterances with varying emotion, accents and background noise, and therefore comprises an entirely different domain to the interview-style, emotionally calm utterances in current speaker recognition datasets such as VoxCeleb; (ii) We provide a number of domain adaptation evaluation sets, and benchmark the performance of state-of-the-art speaker recognition models on these evaluation pairs. We demonstrate that both speaker verification and identification performance drops steeply on this new data, showing the challenge in transferring models across domains; and finally (iii) We show that simple domain adaptation paradigms improve performance, but there is still large room for improvement.
翻译:这项工作的目标是调查电影中演讲部分的流行语音识别模式,在这些部分中,演员往往故意掩饰自己的声音以扮演一个角色。我们做出了以下三项贡献:(一) 我们收集了一套叫VoxMovies的新型、具有挑战性的语音识别数据集,该数据集在近4000个电影片段中有856个身份的演讲。 VoxMovies 含有各种情感、口音和背景噪音的发声,因此与访谈风格完全不同,在VoxCeleb等当前演讲识别数据集中包含一种情感平静的发音;(二) 我们提供了一些域适应评价成套材料,并对这些两组评价中最先进的语音识别模型的性能进行了基准衡量。我们表明,发言者的核实和识别性能在这一新数据上都急剧下降,显示了在跨领域转移模式方面的挑战;最后(三) 我们表明,简单域的适应模式改善了业绩,但是仍有很大的改进余地。