Despite tremendous progress in natural language processing using deep learning techniques in recent years, sign language production and comprehension has advanced very little. One critical barrier is the lack of largescale datasets available to the public due to the unbearable cost of labeled data generation. Efforts to provide public data for American Sign Language (ASL) comprehension have yielded two datasets, comprising more than thousand video clips. These datasets are large enough to enable a meaningful start to deep learning research on sign languages but are far too small to lead to any solution that can be practically deployed. So far, there is still no suitable dataset for ASL production. We proposed a system that can generate large scale ASL datasets for continuous ASL. It is suitable for general ASL processing and is particularly useful for ASL production. The continuous ASL dataset contains English labeled human articulations in condensed body pose data formats. To better serve the research community, we are releasing the first version of our ASL dataset, which contains 30k sentences, 416k words, a vocabulary of 18k words, in a total of 104 hours. This is the largest continuous sign language dataset published to date in terms of video duration. We also describe a system that can evolve and expand the dataset to incorporate better data processing techniques and more contents when available. It is our hope that the release of this ASL dataset and the sustainable dataset generation system to the public will propel better deep-learning research in ASL natural language processing.
翻译:尽管近年来在利用深层学习技术进行自然语言处理方面取得了巨大进展,但手语的制作和理解却进展甚微。一个关键的障碍是,由于标签数据生成费用难以承受,公众缺乏大规模数据集。为美国手语(ASL)理解提供公共数据的努力产生了两套数据集,其中包括1,000多个视频剪辑。这些数据集足够大,足以有意义地开始深入学习手语研究,但太小,无法导致任何可以实际部署的解决方案。到目前为止,尚没有适合ASL制作的数据集。我们提议了一个能够为持续ASL生成大规模ASL数据集的系统。它适合通用ASL处理,并且特别有益于ASL的制作。持续的ASL数据集包含在压缩体中贴有英文标签的人文表达,构成数据格式。为了更好地为研究界服务,我们正在发布我们的ASL数据集的第一版,其中包含30k句,416k字,18k字词的词汇,总共104小时。这是最大规模的持续签名语言数据处理系统,这个持续地标定出数据流流流流到数据流到数据流到数据流流流的更好时间,我们的数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流的更好时间。我们可以更好、数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流到数据流时间。