Supervised speech enhancement relies on parallel databases of degraded speech signals and their clean reference signals during training. This setting prohibits the use of real-world degraded speech data that may better represent the scenarios where such systems are used. In this paper, we explore methods that enable supervised speech enhancement systems to train on real-world degraded speech data. Specifically, we propose a semi-supervised approach for speech enhancement in which we first train a modified vector-quantized variational autoencoder that solves a source separation task. We then use this trained autoencoder to further train an enhancement network using real-world noisy speech data by computing a triplet-based unsupervised loss function. Experiments show promising results for incorporating real-world data in training speech enhancement systems.
翻译:受监督的语音增强取决于关于退化的语音信号及其在培训期间的清洁参考信号的平行数据库。 这一设置禁止使用真实世界退化的语音数据, 这些数据可能更好地代表使用这些系统的情形。 在本文中, 我们探索了能够使监管的语音增强系统能够培训真实世界退化的语音数据的方法。 具体地说, 我们提出了一种半监督的语音增强方法, 我们首先在其中培训一个经过修改的矢量定量变异自动编码器, 解决源分离任务。 我们随后使用这个经过培训的自动编码器, 通过计算基于三重制且不受监督的损失功能, 进一步培训使用真实世界噪音的语音数据增强网络。 实验显示将真实世界数据纳入语言强化系统培训的有希望的结果 。