The adoption of advanced deep learning (DL) architecture in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted with pre-trained deep models trained on massive audio datasets for different tasks. In particular, we explore audio representations obtained using emphasized channel attention, propagation, and aggregation-time-delay neural network (ECAPA-TDNN) and Wav2Vec2.0 model trained on VoxCeleb and LibriSpeech datasets respectively. After extracting the embeddings, we benchmark with several traditional classifiers, such as a k-nearest neighbor, Gaussian naive Bayes, and neural network, for the stuttering detection tasks. In comparison to the standard SD system trained only on the limited SEP-28k dataset, we obtain a relative improvement of 16.74% in terms of overall accuracy over baseline. Finally, we have shown that combining two embeddings and concatenating multiple layers of Wav2Vec2.0 can further improve SD performance up to 1% and 2.64% respectively.
翻译:由于现有数据集规模有限,采用先进的深层次学习(DL)结构来探测(SD)任务具有挑战性。 为此,这项工作引入了应用语言嵌入器,这些嵌入器采用经过预先训练的深层模型,这些模型是针对不同任务进行大规模音频数据集培训的。特别是,我们探索了利用强调的频道注意力、传播和聚合-时间间隔神经网络(ECAPA-TDNNN)和分别受过VoxCeleb和LibriSpeech培训的Wav2Vec2.0模型获得的音频表达器。在提取嵌入器后,我们与几个传统分类器,如K-近邻、高山天真湾和神经网络等,为静音检测任务设定基准。与仅就有限的SEP-28k数据集(ECPA-TDNN)培训的标准SD系统相比,我们比基线的总体精确度提高了16.74%。 最后,我们显示,将两个嵌入和配置Wav2Ve2.0多层数据组合起来,可以分别将SD的性性性能分别提高到1%和2.64%。