Trigger-word detection plays an important role as the entry point of user's communication with voice assistants. But supporting a particular word as a trigger-word involves huge amount of data collection, augmentation and labelling for that word. This makes supporting new trigger-words a tedious and time consuming process. To combat this, we explore the use of contrastive learning as a pre-training task that helps the detection model to generalize to different words and noise conditions. We explore supervised contrastive techniques and also propose a novel self-supervised training technique using chunked words from long sentence audios. We show that both supervised and the new self-supervised contrastive pre-training techniques have comparable results to a traditional classification pre-training on new trigger words with less data availability.
翻译:触发字探测作为用户与语音助理沟通的切入点发挥着重要作用。 但是,支持一个特定词作为触发字涉及大量数据收集、增强和该词的标签。 这使得支持新的触发字是一个乏味和耗时的过程。 为了解决这一问题,我们探索将对比学习作为一种培训前任务,帮助检测模型推广到不同的词和噪音条件。 我们探索了受监督的对比技术,并提出了一种使用长句音频散开的词进行自我监督的新式培训技术。 我们显示,受监督的和新的自我监督的训练前对比词具有类似于传统分类前培训的结果,即对数据较少的触发字进行新的分类前培训。