Automatic Speech Recognition (ASR) for air traffic control is generally trained by pooling Air Traffic Controller (ATCO) and pilot data into one set. This is motivated by the fact that pilot's voice communications are more scarce than ATCOs. Due to this data imbalance and other reasons (e.g., varying acoustic conditions), the speech from ATCOs is usually recognized more accurately than from pilots. Automatically identifying the speaker roles is a challenging task, especially in the case of the noisy voice recordings collected using Very High Frequency (VHF) receivers or due to the unavailability of the push-to-talk (PTT) signal, i.e., both audio channels are mixed. In this work, we propose to (1) automatically segment the ATCO and pilot data based on an intuitive approach exploiting ASR transcripts and (2) subsequently consider an automatic recognition of ATCOs' and pilots' voice as two separate tasks. Our work is performed on VHF audio data with high noise levels, i.e., signal-to-noise (SNR) ratios below 15 dB, as this data is recognized to be helpful for various speech-based machine-learning tasks. Specifically, for the speaker role identification task, the module is represented by a simple yet efficient knowledge-based system exploiting a grammar defined by the International Civil Aviation Organization (ICAO). The system accepts text as the input, either manually verified annotations or automatically generated transcripts. The developed approach provides an average accuracy in speaker role identification of about 83%. Finally, we show that training an acoustic model for ASR tasks separately (i.e., separate models for ATCOs and pilots) or using a multitask approach is well suited for the noisy data and outperforms the traditional ASR system where all data is pooled together.
翻译:空中交通管制自动语音识别(ASR)通常通过将空中交通管理员(ATCO)和试点数据汇集到一个数据集中来培训空中交通管制自动语音识别(ASR),其动机是试点项目的语音通信比ATCO更稀少。由于数据不平衡和其他原因(例如声学条件不同),ATCO的演讲通常比试点项目更准确地得到承认。自动识别发言者的作用是一项艰巨的任务,特别是使用甚高频接收器收集的噪音录音,或由于无法使用推接信号(PTT)信号,即两个音频频道是混杂的。在这项工作中,我们提议(1) 自动分割ATCO和基于直观方法的试点数据,利用ASR记录誊本,然后考虑自动识别ATCOs和飞行员的声音,作为两项单独的任务。我们的工作是在甚高噪音水平的甚高音频音频数据上进行,即信号到音响频率(SNRR)比率低于15 dB,因为这一数据被确认有助于不同语音对调调信号信号的信号频道的信号传输。我们提议,一个基于基于基于语音的系统的标准识别模型,而一个基于一个基于A类语言的系统的数据,而一个基于A-A-Alistrememex 数据,一个基于一个基于基于基于基于基于AS定义的日历的数据的系统的数据,一个通用的自动显示,一个自动显示的自动显示的系统,一个基于一个基于一个基于一个基于系统的数据,一个基于一个基于一个基于一个基于一个系统的数据的系统,一个基于一个基于一个系统的数据的自动的系统,一个系统,一个基于一个基于一个系统,一个基于一个基于一个基于一个系统的数据路路路路路路路路路路的系统,一个基于一个通用的系统,用来显示的系统,一个以的自动的系统,用来显示一个基于一个基于一个基于一个基于的自动的数据,一个基于一个基于一个基于一个基于的系统的一种数据路段。