This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front-ends with other tasks, including automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU). To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research. In addition to these new tasks, we also use CHiME-4 and WSJ0-2Mix to benchmark multi- and single-channel SE approaches. Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR, especially in the multi-channel scenario. The code is available online at https://github.com/ESPnet/ESPnet. The multi-channel ST and SLU datasets, which are another contribution of this work, are released on HuggingFace.
翻译:本文件介绍了在将语音分解和增强(SSE)纳入ESPnet工具包方面的最新进展。与ESPnet-SE先前的工作相比,增加了许多特点,包括最近的最先进的语音强化模型及其各自的培训和评估配方。重要的是,设计了一个新的界面,以灵活地将增强语音前端与其他任务,包括自动语音识别(ASR)、语音翻译(ST)和口语理解(SLU)结合起来。为了展示这种整合,我们进行了精心设计的合成数据集的实验,用于杂音反动多频道ST和SLU的任务,这些任务可以用作未来研究的基准公司。除了这些新任务外,我们还利用CHime-4和WSSJ0-2Mix为多频道和单一频道SE做法的基准。结果显示,SE前端与后端任务相结合甚至对于ASR之外的任务,特别是多频道情景中的任务来说,是一个很有希望的研究方向。该代码可在https://github.com/ESPnet/ESPnet/ESPnet上在线查阅,可作为未来研究的基准。除了这些新任务之外,我们还利用CHimMEMECS-Set的另一种数据系统,这是另一个工作。HCS-CS-Faset和S-Faset。这是另一个数据,这是另一个数据系统。