In this paper, we present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words. The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata. The meetings are in English, partly spoken by non-native speakers, and partly spoken by interpreters. We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis. We focused on the unsupervised domain adaptation of the ASR pipeline. Building on the transformer-based Wav2vec2.0 model, we experimented with multiple acoustic models, language models and the addition of domain-specific terms. We found that a domain-specific acoustic model and a domain-specific language model give substantial improvements to the ASR output, reducing the word error rate (WER) from 28.22 to 17.95. The use of domain-specific terms in the decoding stage did not have a positive effect on the quality of the ASR in terms of WER. Initial topic modelling results indicated that the corpus is useful for downstream analysis tasks. We release the resulting corpus and our analysis pipeline for future research.
翻译:在本文中,我们介绍了一个由欧盟议会LIBE委员会组成的转录语料库,总计360万个单词。欧盟议会的委员会会议是政治学家的潜在有价值的信息来源,但由于只公开作为语音记录以及有限的元数据而无法立即获取数据。会议以英语进行,部分由非母语人士讲话,部分由口译员讲话。我们调查了建立准确的文本转录的最适当的自动语音识别(ASR)模型,以便使其内容可供研究和分析。我们专注于ASR流水线的无监督领域适应。基于基于转换器的Wav2vec2.0模型,我们尝试了多个声学模型,语言模型以及添加了特定领域术语。我们发现,特定领域的声学模型和特定领域的语言模型在ASR输出方面给出了实质性的改进,将字错率(WER)从28.22降至17.95。在解码阶段使用特定领域术语并没有对ASR的质量产生积极影响。初始主题模型结果表明,这个语料库对下游分析任务很有用。我们发布了结果语料库和我们的分析流程以供未来研究使用。