Current findings show that pre-trained wav2vec 2.0 models can be successfully used as feature extractors to discriminate on speaker-based tasks. We demonstrate that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be effectively used for binary classification of various types of pathologic speech. We examine the pathologies laryngectomy, oral squamous cell carcinoma, parkinson's disease and cleft lip and palate for this purpose. The results show that a distinction between pathological and healthy voices, especially with latent representations from the lower layers, performs well with the lowest accuracy from 77.2% for parkinson's disease to 100% for laryngectomy classification. However, cross-pathology and cross-healthy tests show that the trained classifiers seem to be biased. The recognition rates vary considerably if there is a mismatch between training and out-of-domain test data, e.g., in age, spoken content or acoustic conditions.
翻译:目前的调查结果显示,预先训练的 wav2vec 2. 0 模型可以成功地用作特征提取器,对以发言者为基础的任务进行区分。我们证明,在预先训练的 wav2vec 2.0 系统的不同层次上抽取的潜在表现,可以有效地用于对各种类型的病理语言进行二进制分类。我们检查了用于此目的的宫颈切除术、口腔相交细胞癌、帕金森氏病和左唇和嘴唇。结果显示,病理和健康声音之间的区分,特别是低层的潜在表现,表现的精确度最低,从Parkinson病的77.2%到喉切除术的100%。然而,交叉病理学和交叉健康测试表明,受过训练的分类人员似乎有偏差。如果在年龄、口述内容或声学条件方面培训与外部试验数据不匹配,则承认率差别很大。