In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present in the acoustic signal but absent in transcription. Here we propose a new STT task: end-to-end neural transcription with fully formatted text for target labels. We present baseline Conformer-based models trained on a corpus of 5,000 hours of professionally transcribed earnings calls, achieving a CER of 1.7. As a contribution to the STT research community, we release the corpus free for non-commercial use at https://datasets.kensho.com/datasets/scribe.
翻译:在英语语音到文字(STT)机器学习任务中,声学模型通常就未记录拉丁字符进行常规培训,任何必要的正文学(如资本化、标出、非标准词的非正统化)都由单独的处理后模型估算,这增加了复杂性和限制性能,因为许多格式化任务受益于声学信号中的语义信息,但在转录中却不存在。在这里,我们提议一项新的STT任务:终端到终端神经转录,附有目标标签的完整格式文本。我们介绍了基于基准的基于Confer的模型,培训了5 000小时的专业转录收入电话,实现了1.7的CER。作为对STT研究界的贡献,我们在https://datasets.kensho.com/datasets/cords。