In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training. Experiments conducted on spoken dialog data show that our proposed model trained to validate barge-in entirely from speech representations is faster by 38% relative and achieves 4.5% relative F1 score improvement over a baseline LSTM model that uses both audio and Automatic Speech Recognition (ASR) 1-best hypotheses. On top of this, our best proposed model with lexically infused representations along with contextual features provides a further relative improvement of 5.7% in the F1 score but only 22% faster than the baseline.
翻译:在这项工作中,我们将驳入核查定义为一项监督性的学习任务,即使用只听音信息将用户口声对话分为真实和虚假的驳入。在经过培训的模型成功之后,我们使用自监督的代言学习模型的低级别演讲演示,用于下游的分类任务。此外,我们提出一种新技术,将词汇信息直接注入语音演示,以改进在培训前暗中学习的特定域语言信息。对口声对话数据进行的实验表明,我们为完全从语音演示中验证驳入信息而培训的模型,比使用语音和自动语音识别(ASR)1最佳假设的基线LSTM模型加快了38%的相对F1,比基准提高了4.5%的相对F1。除此之外,我们提出的带有用法式表达和背景特征的最佳模型,使F1分数5.7%的校验率进一步提高,但仅比基线提高22%。