AI、机器学习、深度学习、数据挖掘或数学领域的研究者：以下是一篇论文的中英文翻译：标题：观看还是聆听：带有视觉污染建模和可靠性评分的鲁棒音视频语音识别 (Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring)

This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations.

翻译：篇名翻译：观看还是聆听：带有视觉污染建模和可靠性评分的鲁棒音视频语音识别摘要翻译：本文研究了多模态输入损坏情况下的音视频语音识别（AVSR），其中音频输入和视觉输入都受到了污染，而这在之前的研究方向中尚未得到很好的解决。之前的研究主要关注如何利用干净的视觉输入来补充污染的音频输入，但是在实际生活中，干净的视觉输入并不总是可用的，甚至可能被遮挡的嘴唇区域或噪声所污染。因此，我们首先分析了之前的AVSR模型与单模型相比，在多模态输入流污染情况下确实不够强健。然后，我们设计了多模态输入污染建模，开发出了鲁棒的AVSR模型。最后，我们提出了一种新的AVSR框架，即音视频可靠性评分模块（AV-RelScore），该模块对受污染的多模态输入具有鲁棒性。AV-RelScore能够确定哪个输入模态流对于预测是可靠的，哪个是不可靠的，并且可以利用更可靠的流进行预测。我们在流行的基准数据库LRS2和LRS3上进行了全面的实验，评估了所提出方法的有效性。我们还展示了AV-RelScore获得的可靠性评分很好地反映了污染程度，并使所提出的模型集中于可靠的多模态表示。