Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus built on spontaneous multi-turn conversational dialogue sources collected in Hong Kong. We report ASCEND's design and procedure for collecting the speech data, including annotations. ASCEND consists of 10.62 hours of clean speech, collected from 23 bilingual speakers of Chinese and English. Furthermore, we conduct baseline experiments using pre-trained wav2vec 2.0 models, achieving a best performance of 22.69\% character error rate and 27.05% mixed error rate.
翻译:调换密码是一种语言现象。尽管在谈话中口语自发地使用调换密码,但大多数现有作品都收集读话而不是自发的调换码数据。 ASCEND(自发的中文-英文数据集)是一种高质量的中文-英文调换码程序,它以在香港收集的自发多方向对话源为基础。我们报告了ASCEND(ASCEND)为收集包括说明在内的语音数据而设计的程序。ASCEND(ASCEND)包括从23个讲中文和英文的双语者那里收集的10.62小时的清洁话。此外,我们使用预先训练的 wav2vec 2. 0 模型进行基线实验,达到22.69 字形错误率和27.05%混合错误率的最佳性能。