Creating universal speaker encoders which are robust for different acoustic and speech duration conditions is a big challenge today. According to our observations systems trained on short speech segments are optimal for short phrase speaker verification and systems trained on long segments are superior for long segments verification. A system trained simultaneously on pooled short and long speech segments does not give optimal verification results and usually degrades both for short and long segments. This paper addresses the problem of creating universal speaker encoders for different speech segments duration. We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture. According to our evaluation results of wav2vec-TDNN based systems obtained for NIST SRE and VoxCeleb1 benchmarks the proposed universal encoder provides speaker verification improvements in case of different enrollment and test speech segment duration. The key feature of the proposed encoder is that it has the same inference time as the selected neural network architecture.
翻译:创建通用的语音编码器对于不同的音响和语音持续时间条件来说都是一个巨大的挑战。根据我们在短语部分培训的观察系统,短语部分的最佳是短语部分的语音核查,长段部分培训的系统优于长段核查。同时培训短语和长语部分的系统不会产生最佳的核查结果,通常会降低短语段和长段的功能。本文件讨论了为不同语言部分创建通用语音编码器的问题。我们描述了为任何类型的选定的神经网络结构培训通用语音编码器的简单方法。根据我们为 NIST SRE 和 VoxCeleb1 提供的基于 wav2vec-TDN 系统的评估结果,拟议的通用编码器在不同的录制和测试语音部分持续时间方面都提供了语音部分的改进。拟议的编码器的关键特征是,它具有与选定的神经网络结构相同的推论时间。