Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
翻译:有关低资源语言的自动语音识别(ASR)提高了语言少数群体获得人工智能(AI)技术优势的机会。 在本文中,我们通过创建一个新的广东数据集来解决香港文数据稀缺的问题。我们的数据集,多域广东公司(MDCC),由73.6小时的清洁读话和来自香港的广东音频书收集的笔录组成,包括哲学、政治、教育、文化、生活方式和家庭领域,涵盖广泛的主题。我们还审查所有现有的广东数据集,并根据其语言类型、数据来源、总大小和可用性对其进行分析。我们进一步与Fairseq S2T变换器(一种最先进的ASR模型),即现有最大数据集、普通语音ZH-HK和我们提议的MCC进行实验,结果显示我们数据集的有效性。此外,我们通过在MCC和普通语音HHK上应用多数据化学习,创建了强大和强大的广东经ASR模型。