Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.
翻译:个人助理、自动语音识别器和对话理解系统在我们相互关联的数字世界中越来越重要。一个明显的例子就是空中交通控制通信。空中交通控制(ATTC)旨在以安全和最佳的方式指导飞机和控制空气空间。这些语音对话是在空中交通控制器(ATCO)和通过甚高频无线电频道进行。为了将这些新技术纳入ATC(资源领域较低),需要大规模附加说明的数据集来开发数据驱动的AI系统。两个例子是自动语音识别(ASR)和自然语言理解(NLU)。在本文件中,我们引入了ATCO2机,这是一个数据集,目的是促进对具有挑战性的ATC字段的研究,由于缺少附加说明的数据,该机落后于该机。ATCO2系统包括:(1)数据收集和预处理,(2)语音数据的伪注,以及(3)开发与ATC有关的实体。 ATCO2Spresidal数据库分为三个子集。1 ATCO2-stest-state 包含有4小时的ATC语音识别工具的ATC语音记录和在EDR-DROO的自动指令中,通过Oal-deal-deal-dealation Oration Oration Onalationalationalationalationalational-deal-dealational-deal-deal-deal-deal-dealation axal-dealation axalationalational dreal-deal drealalstationalationalstalsalsalsalsalsalsrealstalstaltalsalsalsalsalsalsalstationalsalsalsalsalsalsalsalsalsalsalsalsalsalsalsalsalsalsalsalsalsals, 。该数据库, 。该数据库将仅上,该数据库,该数据库将仅上,该数据库将提供,该数据库将仅提供,该数据库提供,该数据库,该数据库将仅上,通过自动识别,该数据库数据库数据库数据库数据库,该数据库数据库数据库数据库提供。该数据库提供。该数据库数据库数据库数据库数据库数据库数据库数据库数据库,该数据库数据库数据库数据库数据库数据库数据库数据库数据库将仅上,该数据库,该数据库数据库数据库数据库数据库数据库数据库,该数据库数据库将提供,该数据库数据库数据库数据库数据库