Psychoacoustic studies have shown that locally-time reversed (LTR) speech, i.e., signal samples time-reversed within a short segment, can be accurately recognised by human listeners. This study addresses the question of how well a state-of-the-art automatic speech recognition (ASR) system would perform on LTR speech. The underlying objective is to explore the feasibility of deploying LTR speech in the training of end-to-end (E2E) ASR models, as an attempt to data augmentation for improving the recognition performance. The investigation starts with experiments to understand the effect of LTR speech on general-purpose ASR. LTR speech with reversed segment duration of 5 ms - 50 ms is rendered and evaluated. For ASR training data augmentation with LTR speech, training sets are created by combining natural speech with different partitions of LTR speech. The efficacy of data augmentation is confirmed by ASR results on speech corpora in various languages and speaking styles. ASR on LTR speech with reversed segment duration of 15 ms - 30 ms is found to have lower error rate than with other segment duration. Data augmentation with these LTR speech achieves satisfactory and consistent improvement on ASR performance.
翻译:心理心理学研究显示,当地时间反转(LTR)演讲,即信号样本在短段段内的时间反转,可以准确地为人类听众所识别。本研究探讨的是,最先进的自动语音识别(ASR)系统在LTR演讲上将发挥何种效果的问题。基本目标是探讨在培训端对端(E2E)ASR模型时部署LTR演讲的可行性,以便增加数据,改进识别性能。调查首先实验了解LTR演讲对一般用途ASR.LTR演讲的影响,其反向部分持续时间为5毫秒至50毫秒。对于ASR培训数据与LTR演讲增强,培训组是通过将自然演讲与LTR演讲的不同部分相结合来创建的。ASR关于语言和语型语言语音的语音组合和语音模式的结果证实了数据增强的效果。LTR发言反向反向持续时间为15毫秒,30毫秒,其误差率低于其他部分段段内的情况。数据增强与LTR演讲的性改进一致。