Recently, deep neural network (DNN)-based speech enhancement (SE) systems have been used with great success. During training, such systems require clean speech data - ideally, in large quantity with a variety of acoustic conditions, many different speaker characteristics and for a given sampling rate (e.g., 48kHz for fullband SE). However, obtaining such clean speech data is not straightforward - especially, if only considering publicly available datasets. At the same time, a lot of material for automatic speech recognition (ASR) with the desired acoustic/speaker/sampling rate characteristics is publicly available except being clean, i.e., it also contains background noise as this is even often desired in order to have ASR systems that are noise-robust. Hence, using such data to train SE systems is not straightforward. In this paper, we propose two improvements to train SE systems on noisy speech data. First, we propose several modifications of the loss functions, which make them robust against noisy speech targets. In particular, computing the median over the sample axis before averaging over time-frequency bins allows to use such data. Furthermore, we propose a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios. For our experiments, we use the Mozilla Common Voice dataset and we show that using our robust loss function improves PESQ by up to 0.19 compared to a system trained in the traditional way. Similarly, for MixIT we can see an improvement of up to 0.27 in PESQ when using our proposed noise augmentation.
翻译:最近,以深神经网络为基础的语音增强系统(DNN)被成功使用。在培训期间,这类系统需要清洁的言语数据 -- -- 理想的情况是,数量众多且有各种声学条件、许多不同的发言者特点和特定取样率(例如,全频SE,48kHz)。然而,获取这种清洁的言语数据并非直截了当 -- -- 特别是,如果仅考虑公开提供的数据集的话。与此同时,大量自动语音识别材料(ASR)和所需的声频/语音/采样率特性都公开提供,但清洁除外,即,这种系统还包含背景的言语数据,因为为了让ASR系统具有噪声-robust,这种系统往往需要。因此,使用这些数据来培训SEE系统并非简单易懂。我们建议对SEE系统进行两项改进,这些修改使损失功能与噪音目标相对稳健。特别是,在平均时间频率的硬盘使用这种数据之前,在样本轴上计算中间轴轴轴轴线上的噪音,我们还可以使用这种改进数据。此外,我们提议用一种稳定变压式的变压方案,我们用一种变压式的变压式的MIT数据,我们用这种变压式的变压式的变压式数据系统来进行。