Many real-world applications require to convert speech files into text with high accuracy with limited resources. This paper proposes a method to recognize large speech database fast using the Transformer-based end-to-end model. Transfomers have improved the state-of-the-art performance in many fields as well as speech recognition. But it is not easy to be used for long sequences. In this paper, various techniques to speed up the recognition of real-world speeches are proposed and tested including parallelizing the recognition using batched beam search, detecting end-of-speech based on connectionist temporal classification (CTC), restricting CTC prefix score and splitting long speeches into short segments. Experiments are conducted with real-world Korean speech recognition task. Experimental results with an 8-hour test corpus show that the proposed system can convert speeches into text in less than 3 minutes with 10.73% character error rate which is 27.1% relatively low compared to conventional DNN-HMM based recognition system.
翻译:许多真实世界应用程序要求将语音文件转换为文本,且精密且资源有限。本文件建议采用基于变换器端对端模式快速识别大型语音数据库的方法。 Transfomers改进了许多领域的最新性能以及语音识别。 但对于长序列来说并不容易使用。 在本文中,提出并测试了加速识别真实世界演讲的各种技术,包括使用分批的波束搜索来平行识别,根据连接时间分类(CTC)探测终端语音,限制CTC前缀分数,将长话分解成短段。实验是在现实世界韩国语音识别任务下进行的。实验结果有8小时的测试体显示,拟议的系统可以将演讲转换为短于3分钟的文本,10.73%的字符错误率比常规的DNN-HMM识别系统低27.1%。