噪音环境中的大型视听语音语音识别 (Large-vocabulary Audio-visual Speech Recognition in Noisy Environments)

Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, such as unsatisfactory video recognition accuracies, that make it hard to improve over audio-only baselines. In this paper, we specifically consider such scenarios, focusing on the large-vocabulary task of the LRS2 database, where audio-only performance is far superior to video-only accuracies, making this an interesting and challenging setup for multi-modal integration. To address the inherent difficulties, we propose a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures. During decoding, this network is used for stream integration within a hybrid recognizer, where it can thus cope with the time-variant reliability and information content of its multiple feature inputs. We compare the results with end-to-end AVSR systems as well as with competitive hybrid baseline models, finding that the new fusion strategy shows superior results, on average even outperforming oracle dynamic stream weighting, which has so far marked the -- realistically unachievable -- upper bound for standard stream weighting. Even though the pure lipreading performance is low, audio-visual integration is helpful under all -- clean, noisy, and reverberant -- conditions. On average, the new system achieves a relative word error rate reduction of 42.18\% compared to the audio-only model, pointing at a high effectiveness of the proposed integration approach.

翻译：视听语音识别(AVSR)能够有效、显著地提高小型词汇系统的识别率,而与其只听音的对应系统相比,小型词汇系统的识别率是高得多的。但是,对于大型词汇系统来说,仍然有许多困难,例如不令人满意的视频识别仪,这使得很难在仅听音基线的基础上改进视听语音识别。在本文中,我们特别考虑这些假设,侧重于LRS2数据库的大型词汇任务,即只听音功能远优于只听视频的理解度,因此这是多式整合的有趣而富有挑战性的。为了解决内在的困难,我们提出了一个新的平均融合战略:一个经常性整合网络,在一套基于模型和基于信号的流的可靠度衡量标准下,在一组混音识别器内,这个网络可以用来应对其多式投入中的时间差的可靠性和信息内容。我们把结果与最终的AVSR系统进行对比,在48级的轨距上,一个比平流的递增的相对比值,一个正常的递增速度,在高的递增率模型下,一个高的递增的递增率,在高的递增水平上,一个高压的递增的递增的递增的基模模型,在高的递增的递增的递增中发现,在高压的递增的递增的递增的递增的递增中,在高的递增的递增的递增率上,在高的递增的递增的递增的递增中,在高的递增的递增,在高的递增率是整个的递增的递增的递增率是所有的递增的递增的递增的递增的递增的递增的递增,在高的递增的递增的递增的递增的递增的递增的递增的递增的压的压的递增的递增,在高的压的递增的递增的压的压的递升的压的压的压的递升率值,在标值,在高,在高的压的递升率上,在标值是的模模型下,在高的压的压的压的压的递增的压的压的压的压的压的压的压的压的压的压的压的推算值上,