多来源域适应文本独立法证发言人承认 (Multi-source Domain Adaptation for Text-independent Forensic Speaker Recognition)

Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains needed in forensic scenarios. Audio analysis for forensic speaker recognition offers unique challenges in model training with multi-domain training data due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and performance loss. Fine-tuning is a commonly-used method for adaptation in order to retrain the model with weights initialized from a well-trained model. Alternatively, in this study, three novel adaptation methods based on domain adversarial training, discrepancy minimization, and moment-matching approaches are proposed to further promote adaptation performance across multiple acoustic domains. A comprehensive set of experiments are conducted to demonstrate that: 1) diverse acoustic environments do impact speaker recognition performance, which could advance research in audio forensics, 2) domain adversarial training learns the discriminative features which are also invariant to shifts between domains, 3) discrepancy-minimizing adaptation achieves effective performance simultaneously across multiple acoustic domains, and 4) moment-matching adaptation along with dynamic distribution alignment also significantly promotes speaker recognition performance on each domain, especially for the LENA-field domain with noise compared to all other systems.

翻译：使发言者的识别系统适应新的环境是一项广泛使用的技术,目的是改进从大型数据中得出的良好模型,使之更适合特定任务小型数据假设情况;然而,以往的研究侧重于单一领域的适应,忽视了从法证假设中所需的多个声学领域收集培训数据这一更为实际的情景; 法医发言人承认的音频分析,由于地点/前景的不确定性以及参考和自然学领域记录之间的差异性差,在以多领域培训数据提供多领域培训模型培训方面提出了独特的挑战; 也很难直接利用小规模特定域数据来培训复杂的神经网络结构,因为域错配和性能损失。微调是一种常用的适应方法,目的是用经过良好培训的模式的权重来重新调配模型。或者,本研究报告提出了三种基于领域对抗性培训、差异最小化和瞬间匹配方法的新适应方法,以进一步促进多个音域的适应性能。进行全面的实验,表明:(1) 不同的声学环境是影响演讲人的认知性表现,这可以推进音频法学研究,2) 领域对抗性培训是一种常用的惯性培训,在每一领域间进行多种程度的适应性变化,同时学习,在领域进行不同领域上,并学习。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

近期必读的6篇CVPR 2020【域自适应（Domain Adaptation）】相关论文和代码

专知会员服务

96+阅读 · 2020年3月24日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【CVPR2020】用于细粒度动作识别的多模式域自适应，Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

专知会员服务

78+阅读 · 2020年2月25日