Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model.
翻译:语音触发检测是一项重要任务, 能够让目标用户在使用关键词句时激活语音助理。 检测器通常接受独立于语音信息之外的语音数据培训, 并用于语音触发检测任务。 但是, 这样的独立语音触发检测器通常会因代表不足群体( 如口音) 的语音而出现性能退化。 在这项工作中, 我们提议了一个新的语音触发检测器, 它可以使用目标发言者的少量发音来提高检测准确度。 我们提议的模型使用了一个编码器解码器结构。 虽然编码器显示语音独立触发检测, 类似于常规检测器, 但解码器预测每个发音的嵌入是个性化的。 个人化语音触发计分随后作为嵌入录音量和测试表达量之间的相似性得分获得。 个性化嵌入可让在计算语音触发计数时对目标发言者的语音信号进行调整, 从而改进语音触发检测的准确性。 实验结果显示, 与基线发言者独立语音触发模型相比, 将错误拒绝率( FRRR) 相对减少38% 。