Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognised irrespective of the spoken language?". These two questions are very important to understand effectiveness and to boost development of multilingual biometric systems. To answer them, we collected a Multilingual Audio-Visual dataset, containing human speech clips of $154$ identities with $3$ language annotations extracted from various videos uploaded online. Extensive experiments on the three splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem.
翻译:近些年来,在跨模式生物鉴别应用中发现面孔和声音之间联系激增,同时有发言者的承认。从这一点出发,我们提出了在同一批人所讲的多种语言之间建立面孔和声音之间联系的艰巨任务。本文件的目的是回答两个密切相关的问题:“面孔联系语言是否独立?” 和“无论语言如何,能否承认一个发言者?” 。这两个问题对于理解多语言生物鉴别系统的有效性和促进其开发非常重要。为了回答这两个问题,我们收集了一个多语言视听数据集,其中包含了154美元身份的人类语音剪辑和从上传的各种视频中提取的3美元语言说明。对拟议数据集的三个分割进行了广泛的实验,以调查和回答这些新颖的研究问题,这些问题清楚地表明了多语言问题的相关性。