With the advent of digital technology, it is more common that committed crimes or legal disputes involve some form of speech recording where the identity of a speaker is questioned [1]. In face of this situation, the field of forensic speaker identification has been looking to shed light on the problem by quantifying how much a speech recording belongs to a particular person in relation to a population. In this work, we explore the use of speech embeddings obtained by training a CNN using the triplet loss. In particular, we focus on the Spanish language which has not been extensively studies. We propose extracting the embeddings from speech spectrograms samples, then explore several configurations of such spectrograms, and finally, quantify the embeddings quality. We also show some limitations of our data setting which is predominantly composed by male speakers. At the end, we propose two approaches to calculate the Likelihood Radio given out speech embeddings and we show that triplet loss is a good alternative to create speech embeddings for forensic speaker identification.
翻译:随着数字技术的出现,犯罪或法律纠纷通常涉及某种形式的语音记录,对发言者的身份提出疑问[1]。面对这种情况,法证演讲者身份鉴定领域一直寻求通过量化某一语音记录对特定人群的归属程度来澄清问题。在这项工作中,我们探索使用使用三重损失来训练CNN的语音嵌入器。特别是,我们侧重于尚未广泛研究的西班牙语。我们提议从语音光谱样本中提取嵌入器,然后探索这些光谱的若干配置,最后,量化嵌入质量。我们还展示了我们主要由男性发言人组成的数据设置的一些局限性。最后,我们提出两种方法来计算“类似”电台的语音嵌入器,我们显示“三重损失”是创建用于法医发言人识别的语音嵌入器的一个很好的替代办法。