Lipreading, also known as visual speech recognition, aims to identify the speech content from videos by analyzing the visual deformations of lips and nearby areas. One of the significant obstacles for research in this field is the lack of proper datasets for a wide variety of languages: so far, these methods have been focused only on English or Chinese. In this paper, we introduce a naturally distributed large-scale benchmark for lipreading in Russian language, named LRWR, which contains 235 classes and 135 speakers. We provide a detailed description of the dataset collection pipeline and dataset statistics. We also present a comprehensive comparison of the current popular lipreading methods on LRWR and conduct a detailed analysis of their performance. The results demonstrate the differences between the benchmarked languages and provide several promising directions for lipreading models finetuning. Thanks to our findings, we also achieved new state-of-the-art results on the LRW benchmark.
翻译:通过分析嘴唇和附近地区的视觉变形,唇印(又称视觉语音识别)的目的是通过分析视频中的语音内容,查明视频中的语音内容。该领域研究的一大障碍是缺乏各种语言的适当数据集:到目前为止,这些方法仅侧重于英语或中文。在本文中,我们引入了一种以俄语(称为LRWW)进行唇印(称为LRW)的自然分布的大规模基准,其中包括235个班和135个发言者。我们详细描述了数据集收集管道和数据集统计数据。我们还全面比较了目前流行的关于LRWW的唇印方法,并详细分析了其绩效。结果显示了基准语言之间的差异,为唇印模型的微调提供了一些有希望的方向。由于我们的调查结果,我们还在LRW基准上取得了新的最新成果。