Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE features of randomly selected streams to simulate diverse stream combinations. During inference, we also present adaptive Connectionist Temporal Classification (CTC) fusion with the help of hierarchical attention mechanisms. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3% across several unseen stream combinations.
翻译:在测试声学条件不同于培训时,通常会观察到自动语音识别系统的性能退化,因此,必须使自动语音识别系统在各种环境扭曲(如背景噪音和回响)方面强大起来,防止背景噪音和反响等各种环境扭曲。在多流范式中,提高稳健性考虑到处理各种看不见的单流条件和流间动态。以前,在多流终端至终端ASR内提出了一个实用的两阶段培训战略,第二阶段以第1阶段通用地物提取器(UFE)的特征制定多流模式。在本文件中,作为扩展,我们推出了一个以不匹配情景为重点的两阶段增强计划:第1阶段强化计划旨在用数据增强技术处理单流输入品种;第2阶段对随机选定的流流的UFE特性应用时间面罩模拟不同的流组合。在推断中,我们还提出了适应性连接性温度分类(CTC)与等级关注机制的帮助。在两个数据集(DIRHA和AMI)上进行了实验,作为扩展,重点是:第1阶段扩大计划旨在用数据放大技术处理单流的单流品种;第2至5级组合。