Automatic Speech Recognition (ASR) systems typically produce unpunctuated transcripts that have poor readability. In addition, building a punctuation restoration system is challenging for low-resource languages, especially for domain-specific applications. In this paper, we propose a Spanish punctuation restoration system designed for a real-time customer support transcription service. To address the data sparsity of Spanish transcripts in the customer support domain, we introduce two transfer-learning-based strategies: 1) domain adaptation using out-of-domain Spanish text data; 2) cross-lingual transfer learning leveraging in-domain English transcript data. Our experiment results show that these strategies improve the accuracy of the Spanish punctuation restoration system.
翻译:自动语音识别系统(ASR)通常产生未标出、可读性差的记录誊本;此外,建立标点恢复系统对低资源语言、特别是具体领域的应用程序来说具有挑战性;在本文件中,我们提议建立一个西班牙标点恢复系统,用于实时客户支持记录处理服务;为解决西班牙记录在客户支持领域的数据广度问题,我们引入了两个基于转让学习的战略:(1) 利用西班牙域外文本数据进行域域调整;(2) 使用跨语言传输学习,利用内域英文记录处理数据;我们的实验结果表明,这些战略提高了西班牙标点恢复系统的准确性。