This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8% F1-score while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and 78% respectively. Our source code that allows reproducing the results and building a simple web-based application for demonstration purposes is available on Github.
翻译:本文介绍了多语种 ASR 系统生成的 ASR 记录誊本的恢复标点的工作。 重点语言是英语、 曼达林语和马来语, 它们是新加坡三种最受欢迎的语言。 根据我们所知, 这是第一个能够同时解决这三种语言标点恢复的系统。 传统方法通常将任务视为一个连续标签任务, 但是, 这项工作采用一个填充空位的方法, 预测每个字界的标点符的存在和类型。 这种方法类似于在BERT培训前阶段使用的蒙面语言模型方法, 曼达林语和马来语是新加坡三种最受欢迎的语言。 我们发现, 使用吉巴1 而非仅使用内置的句式Piece代言器, 能够显著改善曼达林笔记本的性能。 英语和曼达林 IWSLT 2022 数据集的实验结果和马来新闻显示, 拟议的“ 遮掩蒙” 74% 和“ 马达林” 的英格- 和“ 兰卡” 代码可以分别用于英语和英语 。