In this paper, we propose an end-to-end Mandarin tone classification method from continuous speech utterances utilizing both the spectrogram and the short-term context information as the input. Both spectrograms and context segment features are used to train the tone classifier. We first divide the spectrogram frames into syllable segments using force alignment results produced by an ASR model. Then we extract the short-term segment features to capture the context information across multiple syllables. Feeding both the spectrogram and the short-term context segment features into an end-to-end model could significantly improve the performance. Experiments are performed on a large-scale open-source Mandarin speech dataset to evaluate the proposed method. Results show that this method improves the classification accuracy from 79.5% to 92.6% on the AISHELL3 database.
翻译:在本文中,我们建议使用光谱图和短期背景信息作为输入,从连续语音语句中采用端到端的普通话语调分类方法。光谱和上下文部分功能都用于培训语调分类员。我们首先使用ASR模型产生的对齐结果将光谱框架分为可听部分。然后,我们提取短期部分功能,以捕捉跨多个音频的上下文信息。将光谱和短期背景部分功能输入一个端到端模式,可以大大改善性能。在大型开放源代码曼达林语语音数据集上进行了实验,以评价拟议方法。结果显示,这种方法提高了AISHELL3数据库的分类精度,从79.5%提高到92.6%。