Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.
翻译:在人类以多种方式理解语言的启发下,建立了各种视听数据集,然而,大多数现有数据集都以英文为重点,在数据集编制期间,与各种预测模型产生依赖性,而且只有少量多视视频。为了缓解这些局限性,我们最近开发了开放型大型韩国视听演讲(OLKAVS)数据集,这是公开可公开获取的视听语音数据集中最大的数据集。该数据集包含1 107个韩国发言者在工作室设置的1 150小时转录音频,有9种不同的观点和各种噪音情况。我们还为两项任务提供了预先培训的基线模型,即视听语音识别和唇读。我们根据这些模型进行了实验,以核实多模式和多视培训在单模式和仅供观看的培训方面的有效性。我们期望 OLKAVS数据集有助于韩国语音识别、语音识别、读音水平分类和口腔动作分析等更广泛的领域的多模式研究。