Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.
翻译:除了众所周知的分类任务之外,这些天的神经网络经常被用来生成或转换数据,例如图像和音频信号。在这类任务中,普通损失功能,如平均平方错误(MSE)可能不会产生令人满意的结果。为了提高生成信号的感知质量,一种可能性是增加其与真实信号的相似性,即通过歧视者网络对相似性进行评估。发电机与歧视者网的结合被称为“基因反向网络(GAN) ” 。在这里,我们评估了这种对立式培训框架,即对动脉至声波绘图任务,目的是从动脉器官运动的记录中重建语音信号。作为生成者,我们应用了3D革命网络,在早期的研究中给我们带来了良好结果。要将其转化为歧视者网络,我们将常规的MSE培训损失与对抗性损失部分结合起来。关于评价,我们报告各种客观的演讲质量指标,例如语音质量概念评价(PESQ),目的是从感官评估语言质量(PESQ)中重建语言信号信号信号信号。作为生成者,我们应用3D革命网络,我们应用了早期研究的结果。我们要将所有对抗性损失标准的改进。