高效的TDNNN:在野外有效建筑搜索 (EfficientTDNN: Efficient Architecture Search for Speaker Recognition in the Wild)

Speaker recognition refers to audio biometrics that utilizes acoustic characteristics. These systems have emerged as an essential means of authenticating identity in various areas such as smart homes, general business interactions, e-commerce applications, and forensics. The mismatch between development and real-world data causes a shift of speaker embedding space and severely degrades the performance of speaker recognition. Extensive efforts have been devoted to address speaker recognition in the wild, but these often neglect computation and storage requirements. In this work, we propose an efficient time-delay neural network (EfficientTDNN) based on neural architecture search to improve inference efficiency while maintaining recognition accuracy. The proposed EfficientTDNN contains three phases: supernet design, progressive training, and architecture search. Firstly, we borrow the design of TDNN to construct a supernet that enables sampling subnets with different depth, kernel, and width. Secondly, the supernet is progressively trained with multi-condition data augmentation to mitigate interference between subnets and overcome the challenge of optimizing a huge search space. Thirdly, an accuracy predictor and efficiency estimator are proposed to use in the architecture search to derive the specialized subnet under the given efficiency constraints. Experimental results on the VoxCeleb dataset show EfficientTDNN achieves 1.55% equal error rate (EER) and 0.138 detection cost function (DCF$_{0.01}$) with 565M multiply-accumulate operations (MACs) as well as 0.96% EER and 0.108 DCF$_{0.01}$ with 1.46G MACs. Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency.

翻译：使用声学特性的音频发言人识别技术; 这些系统已成为在智能家庭、一般商业互动、电子商务应用和法医学等各个领域验证身份的基本手段。开发与现实世界数据之间的不匹配导致语音嵌入空间的变换,并严重降低语音识别的性能。已经做出大量努力,在野外向语音识别,但这些往往忽视计算和存储要求。在这项工作中,我们提议建立一个高效的时隔46 的神经网络( 高科技网 ), 以神经结构搜索为基础, 提高推断效率, 同时保持识别准确性。拟议的高效的TDNNN 包含三个阶段: 超级网络设计、渐进式培训和结构搜索。首先, 我们借用 TDNNN 设计来构建一个超级网络, 能够以不同深度、内核、宽度和宽度的方式取样子网。第二, 正在逐步进行多附加条件的数据增强培训,以减少子网之间的干扰, 并克服一个巨大的搜索空间。第三, 在结构搜索中使用准确性预测器和效率测量器, 而不是以0. 0. 0. 0. 0. 80 运行中, 以实验性 ENF 测试结果显示水平水平水平, 以水平以水平水平水平显示为水平水平水平水平水平, 水平水平水平水平水平水平水平水平, 水平水平值值值值值, 值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。