We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.
翻译:我们为英语提出了一个新颖的多式、不言而喻的标点预测系统,它结合了声学和文字特征。我们第一次证明,我们完全依靠使用超音速读音到语音系统生成的合成数据,就能够超越在未发声的标点预测问题上经过昂贵人类录音记录培训的模型。我们的模型结构非常适合在设备上使用。这是通过利用基于散装的自动语音识别文本输出与声学特性相结合作为准经常神经网络的投入来实现的,使模型大小小和耐久性低。