Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model and discuss their differences in performance on Earnings-21. Using our recently released fstalign tool, we provide a candid analysis of each model's recognition capabilities under different partitions. Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage. Earnings-21 bridges academic and commercial ASR system evaluation and enables further research on entity modeling and WER on real world audio.
翻译:通常使用的言论公司对学术和商业ASR系统没有适当的挑战,特别是,言论公司缺乏详细分析和衡量WER所需的元数据。对此,我们提出收入21,这是一套39小时的收入呼吁,包含来自九个不同金融部门的实体激烈讲话。这个资料旨在将野外的ASR系统作为基准,特别注意名称实体的承认。我们以四种商业ASR模型、两个使用开放源工具的内部模型、一个公开源码LibriSpeech模型为基准,并讨论其在21岁收入业绩方面的差异。我们利用我们最近发行的Fastalign工具,对每种模型在不同分区的识别能力进行了坦率的分析。我们的分析发现,某些NER类别的ASR准确性很差,严重妨碍了对理解和使用的笔录。21世纪将学术和商业ASR系统评估连接起来,并能够进一步研究实体建模和真实世界音频的WER。