Speech Enhancement (SE) systems typically operate on monaural input and are used for applications including voice communications and capture cleanup for user generated content. Recent advancements and changes in the devices used for these applications are likely to lead to an increase in the amount of two-channel content for the same applications. However, SE systems are typically designed for monaural input; stereo results produced using trivial methods such as channel independent or mid-side processing may be unsatisfactory, including substantial speech distortions. To address this, we propose a system which creates a novel representation of stereo signals called Custom Mid-Side Signals (CMSS). CMSS allow benefits of mid-side signals for center-panned speech to be extended to a much larger class of input signals. This in turn allows any existing monaural SE system to operate as an efficient stereo system by processing the custom mid signal. We describe how the parameters needed for CMSS can be efficiently estimated by a component of the spatio-level filtering source separation system. Subjective listening using state-of-the-art deep learning-based SE systems on stereo content with various speech mixing styles shows that CMSS processing leads to improved speech quality at approximately half the cost of channel-independent processing.
翻译:语音增强(SE)系统通常在寺庙输入上运作,用于应用,包括语音通信和捕捉用户生成的内容的清理。这些应用所使用的设备最近的进步和变化可能会增加同一应用的双通道内容数量。然而,SE系统通常是为寺庙输入设计的;使用像频道独立或中侧处理这样的微不足道方法产生的立体结果可能不令人满意,包括严重的语音扭曲。为此,我们提议建立一个系统,以新颖的立体信号表示名为“自定义中西德信号”(CMSS)的立体信号。CMSS允许中间端信号的好处扩大到更大层次的输入信号。这反过来又允许任何现有的“SEE”系统通过处理自定义的中间信号,作为高效的立体系统运作。我们描述CMSS所需的参数如何通过垃圾桶级过滤源分离系统的一个部件来高效地估算。在具有各种语音混合风格的立体内容上,以最先进的深学习为基的SEE系统进行主观监听。CMSS处理时,可以使CMSS处理系统在大约一半的语音混合风格上改进了语音质量。