Recent Automatic Speech Recognition systems have been moving towards end-to-end systems that can be trained together. Numerous techniques that have been proposed recently enabled this trend, including feature extraction with CNNs, context capturing and acoustic feature modeling with RNNs, automatic alignment of input and output sequences using Connectionist Temporal Classifications, as well as replacing traditional n-gram language models with RNN Language Models. Historically, there has been a lot of interest in automatic punctuation in textual or speech to text context. However, there seems to be little interest in incorporating automatic punctuation into the emerging neural network based end-to-end speech recognition systems, partially due to the lack of English speech corpus with punctuated transcripts. In this study, we propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from ted.com. We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals. Combining Damerau Levenshtein Distance and slot error rate into DLev-SER, we enable measurement of punctuation error rate when the hypothesis text is not perfectly aligned with the reference. Compared with previous methods, our model reduces slot error rate from 0.497 to 0.341.
翻译:最近的自动语音识别系统一直朝着可以一起培训的端对端系统发展。最近提出的许多技术,包括有线电视新闻网的特征提取、环境捕捉和声学特征与区域网络网络的模型、使用连接时空分类自动调整输入和输出序列,以及用RNN语言模型取代传统的n-gram语言模型。从历史上看,人们对文本或语音到文本的自动标语标语系统的兴趣很大。然而,似乎对将自动标语纳入正在形成的以神经网络为基础的端对端语音识别系统的兴趣不大,部分原因是缺少英文语音和标语记录机的模型。在本研究中,我们提出了一个方法,用来自ted.com的笔录生成TEDLIUM数据集的标语誊本。我们还提议了一个端对端对端自动标语系统,从语音信号中输出单词和标语标语的标语。将Damerau Levshtein 和时位错误率合并为DLev-SERSER,我们之所以能够测量与之前的0.3比率假设相比,我们能够测量与以前的0.47的精确率,而不是精确率比。