Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.
翻译:世界上大约1.2%的人口的语音制作受到影响,因此,自动听力语音探测吸引了相当的学术和临床兴趣,但是,现有的自动语音评估方法往往无法在培训条件或其他相关应用程序之外加以普及。在本文中,我们提议了一个深层次学习框架,用于生成对声音质量敏感且强健的声学特性嵌入在不同体群中。对比性损失与分类损失相结合,以共同培训我们的深层次学习模型。数据扭曲方法用于输入语音样本,以提高我们方法的稳健性。经验性结果显示,我们的方法不仅在体内和跨体分类精确度方面达到很高的水平,而且还在不同体体群中生成了对声音质量敏感和强健健的良好嵌入。我们还比较了我们的结果,以三种基线方法为基础,即清洁和三次变形的体内和跨体中变形数据集,并证明拟议的模型始终超越了基线方法。