Stuttering is a speech disorder where the natural flow of speech is interrupted by blocks, repetitions or prolongations of syllables, words and phrases. The majority of existing automatic speech recognition (ASR) interfaces perform poorly on utterances with stutter, mainly due to lack of matched training data. Synthesis of speech with stutter thus presents an opportunity to improve ASR for this type of speech. We describe Stutter-TTS, an end-to-end neural text-to-speech model capable of synthesizing diverse types of stuttering utterances. We develop a simple, yet effective prosody-control strategy whereby additional tokens are introduced into source text during training to represent specific stuttering characteristics. By choosing the position of the stutter tokens, Stutter-TTS allows word-level control of where stuttering occurs in the synthesized utterance. We are able to synthesize stutter events with high accuracy (F1-scores between 0.63 and 0.84, depending on stutter type). By fine-tuning an ASR model on synthetic stuttered speech we are able to reduce word error by 5.7% relative on stuttered utterances, with only minor (<0.2% relative) degradation for fluent utterances.
翻译:Stuter-TTS 是一个语言障碍, 其语言的自然流动被块块、 重复或长长的音调、 单词和短语中断。 大部分现有的自动语音识别( ASR) 界面在与结结的语句上表现不佳, 主要是因为缺少匹配的培训数据。 将语调合成结结结结结结结结结结结结结结结结结结结结结结结结结结结结结结结结结的神经文本到语音模型。 我们开发了一个简单而有效的Prosody控制策略, 通过在培训期间将更多符号引入源文本, 以代表具体的静结特征。 通过选择结结结结的语的位置, Stutter-TTS 允许对此类语调中发生静结的地方进行字级控制。 我们能够以高精度合成结结结结结结结的节事件( F1- 数介于0.63 和 0.84 之间, 取决于结结结结结结的语类型 ) 。 我们通过精细调的ASR模型来代表具体的静结结结结结结结结结,, 能够降低5.7 节的言差差差差差差差。