Recent research using pre-trained transformer models suggests that just 10 minutes of transcribed speech may be enough to fine-tune such a model for automatic speech recognition (ASR) -- at least if we can also leverage vast amounts of text data (803 million tokens). But is that much text data necessary? We study the use of different amounts of text data, both for creating a lexicon that constrains ASR decoding to possible words (e.g. *dogz vs. dogs), and for training larger language models that bias the system toward probable word sequences (e.g. too dogs vs. two dogs). We perform experiments using 10 minutes of transcribed speech from English (for replicating prior work) and two additional pairs of languages differing in the availability of supplemental text data: Gronings and Frisian (~7.5M token corpora available), and Besemah and Nasal (only small lexica available). For all languages, we found that using only a lexicon did not appreciably improve ASR performance. For Gronings and Frisian, we found that lexica and language models derived from 'novel-length' 80k token subcorpora reduced the word error rate (WER) to 39% on average. Our findings suggest that where a text corpus in the upper tens of thousands of tokens or more is available, fine-tuning a transformer model with just tens of minutes of transcribed speech holds some promise towards obtaining human-correctable transcriptions near the 30% WER rule-of-thumb.
翻译:使用培训前变压器模型的近期研究表明,只有10分钟的转录发言可能足以微调自动语音识别模式(ASR) -- -- 至少如果我们也能利用大量文本数据(8.03亿个符号) -- -- 如此多的文本数据是必需的吗?我们研究使用不同数量的文本数据,这既是为了创建将ASR解码到可能的单词(例如*dogz对狗)的词汇,也是为了培训将系统偏向于可能的单词序列(例如狗对两只狗)的更大语言模型。我们用10分钟的英文转录发言(用于复制先前的工作)和另外两对在补充文本数据提供方面不同的语言进行实验:Gronings和Frisian(可使用~7.5M象征性囊团)以及Besemah和Nasal(仅提供小的词汇)。关于所有语言,我们发现仅使用一个Lexicononon不会明显改善ASR的音序(例如Gronings and Frisalalal)的文本,我们发现在80个版本的纸质模型上可以降低了我们可理解的纸质模型。