One of the limitations in end-to-end automatic speech recognition framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose a random utterance concatenation (RUC) method to alleviate train-test utterance length mismatch issue for short-video speech recognition task. Specifically, we are motivated by observations our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to sub-optimal performance. Experimentally, by using the proposed RUC method, the best word error rate reduction (WERR) can be achieved with around three fold training data size increase as well as two utterance concatenation for each. In practice, the proposed method consistently outperforms the strong baseline models, where 3.64% average WERR is achieved on 14 languages.
翻译:端到端自动语音识别框架的一个限制是其性能如果火车测试的发音长度不匹配,则其性能就会受到影响。在本文中,我们建议采用随机发音连接(RUC)法来缓解短视语音识别任务中火车测试的发声长度不匹配问题。具体地说,我们之所以有动力是因为观察我们人类在培训方面的发音,对于短视自发语音(平均为~3秒)来说,其短视自发语音(平均为~3秒)往往要短得多,而我们从语音活动检测前端产生的测试性能要长得多(平均为~10秒 ) 。这种不匹配可能导致亚最佳性能。 实验性地说,通过使用拟议的 RUC 方法,可以实现最佳的字出错率降低(WERR), 以三种折式培训数据大小和两种发音连接方式实现。在实践中, 拟议的方法始终比强的基线模型要快得多, 其中平均3.64%是14种语言的WERR。