In this case study we trained and published a state-of-the-art open-source model for Automatic Speech Recognition (ASR) for German to evaluate the current potential of this technology for the use in the larger context of Digital Humanities and cultural heritage indexation. Along with this paper we publish our wav2vec2 based speech to text model while we evaluate its performance on a corpus of historical recordings we assembled compared against commercial cloud-based and proprietary services. While our model achieves moderate results, we see that proprietary cloud services fare significantly better. As our results show, recognition rates over 90 percent can currently be achieved, however, these numbers drop quickly once the recordings feature limited audio quality or use of non-every day or outworn language. A big issue is the high variety of different dialects and accents in the German language. Nevertheless, this paper highlights that the currently available quality of recognition is high enough to address various use cases in the Digital Humanities. We argue that ASR will become a key technology for the documentation and analysis of audio-visual sources and identify an array of important questions that the DH community and cultural heritage stakeholders will have to address in the near future.
翻译:在这个案例研究中,我们培训和出版了一个德国人最先进的开放源码自动语音识别模型(ASR),以评价目前这种技术在数字人文和文化遗产指数化的大背景下使用的潜力。与这个论文一起,我们出版了我们基于 wav2vec2 的语音,作为文本模型,同时我们对照商业云和专有服务,在一系列历史记录上评价了我们收集的历史记录的表现。虽然我们的模型取得了适度的结果,但我们发现专有的云服务质量要好得多。但是,正如我们的结果表明,现在可以实现90%以上的专有云服务。但是,一旦录音具有有限的音频质量或非日常或外来语言的使用,这些数字就会迅速下降。一个大的问题是德语中不同方言和口音的高度多样性。然而,本文强调,现有的识别质量足以解决数字人文中的各种使用案例。我们说,ASR将成为记录和分析视听来源的关键技术,并找出DH社区和文化遗产利益攸关方近期必须解决的一系列重要问题。</s>