Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra computation costs of the front-end module are a critical barrier to quick response, especially for streaming ASR. In this paper, we propose a target-speaker ASR (TS-ASR) system that implicitly integrates the target speech extraction functionality within a streaming end-to-end (E2E) ASR system, i.e. recurrent neural network-transducer (RNNT). Our system uses a similar idea as adopted for target speech extraction, but implements it directly at the level of the encoder of RNNT. This allows TS-ASR to be realized without placing extra computation costs on the front-end. Note that this study presents two major differences between prior studies on E2E TS-ASR; we investigate streaming models and base our study on Conformer models, whereas prior studies used RNN-based systems and considered only offline processing. We confirm in experiments that our TS-ASR achieves comparable recognition performance with conventional cascade systems in the offline setting, while reducing computation costs and realizing streaming TS-ASR.
翻译:虽然最近深层学习技术的进步提高了单一听话者案件中自动语音识别(ASR)的功能,但目前仍难以确认多听话者演讲的功能,其中有许多声音重叠。解决这一问题的一种常规办法是使用一系列语音分离或目标语音提取前端,并配有ASR后端。不过,前端模块的额外的计算成本是快速反应的关键障碍,特别是流出ASR。在本文件中,我们提议了一个目标发言人ASR(TS-ASR)系统,该系统将目标语音提取功能隐含地纳入一个流至流的终端-终端(E2E) ASR系统,即经常性神经网络转换器(RNNT)。我们系统使用类似的想法进行目标语音提取,但直接在RNNT常规编码层面实施。这样可以实现TS-ASR(TS-ASR)系统,但不在前端增加额外的计算费用。我们的研究显示E2E TS-AR-ASR(E2E) 自动语音提取功能提取功能功能的先前研究存在两大差异,我们只对流流流式网络网络网络网络-传输模型进行了调查,我们使用前期的测试进行测试研究,我们进行测试时只进行测试模型进行测试,我们进行测试。