One of the biggest challenges in designing mispronunciation detection models is the unavailability of labeled L2 speech data. To overcome such data scarcity, we introduce SpeechBlender -- a fine-grained data augmentation pipeline for generating mispronunciation errors. The SpeechBlender utilizes varieties of masks to target different regions of a phonetic unit, and use the mixing factors to linearly interpolate raw speech signals while generating erroneous pronunciation instances. The masks facilitate smooth blending of the signals, thus generating more effective samples than the `Cut/Paste' method. We show the effectiveness of our augmentation technique in a phoneme-level pronunciation quality assessment task, leveraging only a good pronunciation dataset. With SpeechBlender augmentation, we observed a 3% and 2% increase in Pearson correlation coefficient (PCC) compared to no-augmentation and goodness of pronunciation augmentation scenarios respectively for Speechocean762 testset. Moreover, a 2% rise in PCC is observed when comparing our single-task phoneme-level mispronunciation detection model with a multi-task learning model using multiple-granularity information.
翻译:设计读音错误检测模型的最大挑战之一是没有贴有标签的 L2 语音数据。 为了克服这类数据稀缺, 我们引入了 SpeeBlender -- -- 一种微微微微数据增强管道, 用于生成错误发音错误。 SpeeBlender 将多种面罩用于针对一个语音单元的不同区域, 并将混合因素用于线性间插原始语音信号, 并产生错误发音事件。 面具有助于信号的顺利混合, 从而产生比“ Cut/ Paste” 方法更有效的样本。 我们在电话- 级别发音质量评估任务中展示了我们增强技术的实效, 仅利用了一个好的发音数据集。 在扩增语音单元中, 我们观察到Pearson 相关系数(PCC) 上升了3%和2%, 而Pearson 相关系数(PCC) 分别与 Spealesoce762 测试中不发音加速度假设情景的无起色和好坏感。 此外, 在将我们单塔斯- 电话级的错发性失音检测模型与多等学习模型相比时, PCC 上升了2%。