This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods (e.g., statistical, self-attentive, ghostVLAD pooling). Although the conventional pooling methods provide embeddings with a sufficient amount of speaker-dependent information, our experiments show that these embeddings often lack phrase-dependent information. To mitigate this problem, we propose a new pooling and score compensation methods that leverage a CTC-based automatic speech recognition (ASR) model for taking the lexical content into account. Both methods showed improvement over the conventional techniques, and the best performance was achieved by fusing all the experimented systems, which showed 0.0785% MinDCF and 2.23% EER on the challenge's evaluation subset.
翻译:本文件介绍我们提交2020年短期演讲者核查(SdSV)挑战任务1的情况。任务1是一项依赖文字的演讲者核查任务,要求对演讲者和语句进行核实。提交的系统由基于TDNN和基于ResNet的前端结构组成,其中框架层面的特征与各种集合方法(如统计、自学、幽灵VLAD集合)相结合。虽然常规集合方法为嵌入足够数量依赖演讲者的信息提供了基础,但我们的实验显示,这些嵌入往往缺乏依赖词组的信息。为了缓解这一问题,我们建议采用新的集合和评分方法,利用基于气候技术的自动语音识别模型(ASR)来将词汇内容考虑在内。这两种方法都显示常规技术的改进,通过使用所有试验系统(显示在挑战评价子组上为0.0785% MinDCF和2.23% EER)取得了最佳绩效。