OLKAVS:开放的韩国大型视听语音语音数据集 (OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset)

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.

翻译：在人类以多种方式理解语言的启发下,建立了各种视听数据集,然而,大多数现有数据集都以英文为重点,在数据集编制期间,与各种预测模型产生依赖性,而且只有少量多视视频。为了缓解这些局限性,我们最近开发了开放型大型韩国视听演讲(OLKAVS)数据集,这是公开可公开获取的视听语音数据集中最大的数据集。该数据集包含1 107个韩国发言者在工作室设置的1 150小时转录音频,有9种不同的观点和各种噪音情况。我们还为两项任务提供了预先培训的基线模型,即视听语音识别和唇读。我们根据这些模型进行了实验,以核实多模式和多视培训在单模式和仅供观看的培训方面的有效性。我们期望 OLKAVS数据集有助于韩国语音识别、语音识别、读音水平分类和口腔动作分析等更广泛的领域的多模式研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日