Spoken language understanding (SLU) tasks are usually solved by first transcribing an utterance with automatic speech recognition (ASR) and then feeding the output to a text-based model. Recent advances in self-supervised representation learning for speech data have focused on improving the ASR component. We investigate whether representation learning for speech has matured enough to replace ASR in SLU. We compare learned speech features from wav2vec 2.0, state-of-the-art ASR transcripts, and the ground truth text as input for a novel speech-based named entity recognition task, a cardiac arrest detection task on real-world emergency calls and two existing SLU benchmarks. We show that learned speech features are superior to ASR transcripts on three classification tasks. For machine translation, ASR transcripts are still the better choice. We highlight the intrinsic robustness of wav2vec 2.0 representations to out-of-vocabulary words as key to better performance.
翻译:语言理解(SLU)任务通常通过首先用自动语音识别(ASR)来转换发音,然后将输出输入基于文本的模式来解决。语言数据自我监督的代表学习最近的进展侧重于改进ASR部分。我们调查语言表达学习是否已经成熟到足以取代SLU中的ASR。我们比较了Wav2vec 2.0、最先进的ASR记录誊本以及作为新颖语音名称实体识别任务投入的地面真理文本,真实世界紧急电话的心脏停止检查任务和现有的SLU两个基准。我们显示,学习的语音特征优于ASR的三个分类任务的记录。对于机器翻译来说,ASR记录仍然是更好的选择。我们强调Wav2vec 2.0的内在稳健性,作为更好地表现的关键。