We present the Swiss Parliaments Corpus (SPC), an automatically aligned Swiss German speech to Standard German text corpus. This first version of the corpus is based on publicly available data of the Bernese cantonal parliament and consists of 293 hours of data. It was created using a novel forced sentence alignment procedure and an alignment quality estimator, which can be used to trade off corpus size and quality. We trained Automatic Speech Recognition (ASR) models as baselines on different subsets of the data and achieved a Word Error Rate (WER) of 0.278 and a BLEU score of 0.586 on the SPC test set. The corpus is freely available for download.
翻译:我们介绍了瑞士议会Corpus(SPC)(瑞士议会Corpus),这是瑞士与德国标准文本系统自动一致的德国演讲,第一版该文集以伯尔尼州议会的公开数据为基础,由293小时的数据组成,是使用新的强制判决调整程序和一个可用来交换人身大小和质量的校准质量估测器创建的,我们培训了自动语音识别模型,作为数据不同子集的基线,并实现了0.278的单词错误率和在SPC测试集上0.586的BLEU分数,可自由下载。