In recent years, significant progress has been made in automatic lip reading. But these methods require large-scale datasets that do not exist for many low-resource languages. In this paper, we have presented a new multipurpose audio-visual dataset for Persian. This dataset consists of almost 220 hours of videos with 1760 corresponding speakers. In addition to lip reading, the dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition. Also, it is the first large-scale lip reading dataset in Persian. A baseline method was provided for each mentioned task. In addition, we have proposed a technique to detect visemes (a visual equivalent of a phoneme) in Persian. The visemes obtained by this method increase the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be applied to other languages as well.
翻译:近年来,在自动读嘴唇方面取得了显著进展。 但是,这些方法需要大规模数据集,而许多低资源语言并不存在这些数据集。 在本文中,我们提出了一个新的波斯语多功能视听数据集。该数据集包含近220小时的视频,配有1760个相应的发言者。除了唇读之外,数据集还适合自动语音识别、视听语音识别和语音识别。此外,这是波斯语的第一个大规模读嘴数据集。为上述每项任务提供了基准方法。此外,我们提出了一种探测波斯语直观(直观等于电话)的技术。通过这种方法获得的面谱使唇读工作的准确性比先前提议的直观提高了7%,后者也可以适用于其他语言。