Wake word (WW) spotting is challenging in far-field not only because of the interference in signal transmission but also the complexity in acoustic environments. Traditional WW model training requires large amount of in-domain WW-specific data with substantial human annotations therefore it is hard to build WW models without such data. In this paper we present data-efficient solutions to address the challenges in WW modeling, such as domain-mismatch, noisy conditions, limited annotation, etc. Our proposed system is composed of a multi-condition training pipeline with a stratified data augmentation, which improves the model robustness to a variety of predefined acoustic conditions, together with a semi-supervised learning pipeline to accurately extract the WW and confusable examples from untranscribed speech corpus. Starting from only 10 hours of domain-mismatched WW audio, we are able to enlarge and enrich the training dataset by 20-100 times to capture the acoustic complexity. Our experiments on real user data show that the proposed solutions can achieve comparable performance of a production-grade model by saving 97\% of the amount of WW-specific data collection and 86\% of the bandwidth for annotation.
翻译:(WW) Wake word (WW) 定位在远处是具有挑战性的,不仅因为信号传输受到干扰,而且声学环境也复杂。传统的WW模式培训需要大量的内部WWW特定数据,并有大量的人文说明,因此很难在没有这种数据的情况下建立WW模型。在本文中,我们提出数据效率高的解决方案,以应对WWW建模方面的挑战,如域相配、噪音条件、有限的批注等。我们提议的系统由多功能培训管道组成,配有分层数据增强,使模型对各种预先界定的声学条件更加稳健,并配有半封存的学习管道,以准确提取WWWWW和从未划定的语音资料中互换实例。从仅10小时的域相配制WWWW音频谱开始,我们可以扩大和丰富20-100次的培训数据,以捕捉声学复杂性。我们对实际用户数据的实验表明,拟议解决方案可以通过节省WWW具体数据收集量和86-NT带宽度实现生产级模型的可比性。