In order to evaluate the performance of the attention based neural ASR under noisy conditions, the current trend is to present hours of various noisy speech data to the model and measure the overall word/phoneme error rate (W/PER). In general, it is unclear how these models perform when exposed to a cocktail party setup in which two or more speakers are active. In this paper, we present the mixtures of speech signals to a popular attention-based neural ASR, known as Listen, Attend, and Spell (LAS), at different target-to-interference ratio (TIR) and measure the phoneme error rate. In particular, we investigate in details when two phonemes are mixed what will be the predicted phoneme; in this fashion we build a model in which the most probable predictions for a phoneme are given. We found a 65% relative increase in PER when LAS was presented with mixed speech signals at TIR = 0 dB and the performance approaches the unmixed scenario at TIR = 30 dB. Our results show the model, when presented with mixed phonemes signals, tend to predict those that have higher accuracies during evaluation of original phoneme signals.
翻译:为了评估以注意力为基础的神经ASR在噪音条件下的性能,目前的趋势是向模型提供各种吵闹的语音数据小时,并测量整体单词/电话错误率(W/PER)。一般而言,这些模型在暴露在两个或两个以上发言者活跃的鸡尾酒派对设置中时如何运行。在本文中,我们向以关注为基础的神经ASR(称为听、听和Spell(LAS)),以不同的目标对干扰比率(TIR)和计量电话错误率。特别是,我们详细调查两种电话混合了预测的电话错误率(W/PER)的情况;我们以这种方式建立一个模型来提供最有可能的电话预测。我们发现,在TIR=0 dB时,LAS的语音信号混杂,其性能接近TIR=30 dB时,PER的语音信号会增加65%。我们的结果显示模型,在显示混合电话信号时,我们往往预测在原始电话信号评价中具有更高理解力的模型。