The scarcity of training data and the large speaker variation in dysarthric speech lead to poor accuracy and poor speaker generalization of spoken language understanding systems for dysarthric speech. Through work on the speech features, we focus on improving the model generalization ability with limited dysarthric data. Factorized Hierarchical Variational Auto-Encoders (FHVAE) trained unsupervisedly have shown their advantage in disentangling content and speaker representations. Earlier work showed that the dysarthria shows in both feature vectors. Here, we add adversarial training to bridge the gap between the control and dysarthric speech data domains. We extract dysarthric and speaker invariant features using weak supervision. The extracted features are evaluated on a Spoken Language Understanding task and yield a higher accuracy on unseen speakers with more severe dysarthria compared to features from the basic FHVAE model or plain filterbanks.
翻译:培训数据缺乏,而且听觉语言语言的变异导致对听觉语言语言的口头理解系统的准确性差,发言人一般化程度也差。通过语言特征方面的工作,我们注重改进典型的概括能力,使用有限的读觉数据;经过未经监督而经过培训的定级变异自动-Enctors(FHVAE)在脱钩内容和发言演示方面表现出了优势。早些时候的工作表明,这两类特征矢量都表现出了矛盾状态。在这里,我们增加了对抗性培训,以弥合控制与听觉语言数据领域之间的差距。我们利用微弱的监管,提取了反常和变异的特征。这些被提取的特征是用粗俗的语言理解任务进行评估,并比基本的FHVAE模型或普通过滤器库的特征更精确地评价了有更严重矛盾情绪的隐蔽语者。