As dialogue systems are becoming more and more interactional and social, also the accurate automatic speech recognition (ASR) of conversational speech is of increasing importance. This shifts the focus from short, spontaneous, task-oriented dialogues to the much higher complexity of casual face-to-face conversations. However, the collection and annotation of such conversations is a time-consuming process and data is sparse for this specific speaking style. This paper presents ASR experiments with read and conversational Austrian German as target. In order to deal with having only limited resources available for conversational German and, at the same time, with a large variation among speakers with respect to pronunciation characteristics, we improve a Kaldi-based ASR system by incorporating a (large) knowledge-based pronunciation lexicon, while exploring different data-based methods to restrict the number of pronunciation variants for each lexical entry. We achieve best WER of 0.4% on Austrian German read speech and best average WER of 48.5% on conversational speech. We find that by using our best pronunciation lexicon a similarly high performance can be achieved than by increasing the size of the data used for the language model by approx. 360% to 760%. Our findings indicate that for low-resource scenarios -- despite the general trend in speech technology towards using data-based methods only -- knowledge-based approaches are a successful, efficient method.
翻译:随着对话系统日益变得越来越互动和社交性,对谈话性演讲的准确自动语音识别(ASR)也越来越重要。这把重点从简短的、自发的、面向任务的对话转向更复杂的临时面对面对话。然而,这种对话的收集和批注是一个耗时的过程,对于这种具体的演讲风格来说,数据很少。本文介绍了ASR实验,以阅读和对话的奥地利德语作为目标。为了处理可用于对话用德语的资源有限的问题,同时,在发音特点方面,发言者之间差异很大,我们改进了Kaldi基于任务的ASR系统,纳入了(大)基于知识的读音词汇,同时探索了不同基于数据的方法来限制每个词汇条目的读音变体的数量。我们实现了奥地利德语读音为0.4%的最好WER,在谈话性演讲中,48.5%以平均WER为基础。我们发现,通过使用我们最好的读音法化语言的最好方法,只有类似的高性表现才能实现 -- -- 尽管我们使用的语音分析方法使用了60%的低比例方法,但是使用了我们所使用的语音分析方法中所使用的数据趋势也表明,只有使用了一种低比例的方法。