Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech. Both models work on the raw waveform and use convolutional layers for feature extraction. The first model is based on an LSTM classifier followed by fully connected layers, while the second model adds more convolutional layers followed by fully connected layers. These segmentations predicted by the models are used to obtain measures of speech rate and sound duration. Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems and performs comparably to trained human annotators. Moreover, the LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
翻译:参与者反复制作音调的DDK(DDK), 参与者反复制作音调, 通常作为语言运动障碍评估的一部分使用。 这些研究依靠的是时间密集、主观的人工分析,仅提供粗略的语音图象。 本文展示了两种深神经网络模型,这些模型自动分离出未加注注解、 未经调试的语音和元音。 两种模型在原始波形上工作,并使用变动层进行特征提取。 第一个模型以LSTM分类器为基础,然后是完全相连的层, 而第二个模型则增加了更多的卷动层,然后是完全相连的层。 这些模型预测的分层被用来测量语音速率和声音持续时间。 一个年轻健康的个体数据集的结果显示,我们的LSTM模型超越了当前的最新系统,并且可以比较到培训人类警告器。 此外, LSTM模型还展示了在用Parkinson疾病数据集对看不见的老年人进行评估时, 培训的人类警告员的类似结果。