Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Results of experiments on VoxCeleb dataset suggest that: (i) Simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18%. (ii) Lower dimensional embeddings are more suitable for verification. (iii) Use of proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.
翻译:进化神经网络精确度的逐步提高通常是通过使用在较大数据集方面受过训练的更深和更复杂的模型来实现的,但是,扩大数据集和模型会增加计算和储存费用,而且不能无限期地完成。在这项工作中,我们力求在不使用额外数据或不使用更深和更复杂的模型的情况下,通过扩大培训和测试数据,寻找嵌入空间和使用更具有歧视性的损失功能的最佳维度,提高独立语音识别系统的识别和核查准确性。关于VoxCeleb数据集的实验结果表明:(一) 简单的重复和随机时间转换可以将预测错误减少多达18%。 (二) 较低维位嵌入更适合核查。 (三) 使用拟议的后勤边值功能可以与最先进的识别和竞争性核查装置统一。