Transcription of legal proceedings is very important to enable access to justice. However, speech transcription is an expensive and slow process. In this paper we describe part of a combined research and industrial project for building an automated transcription tool designed specifically for the Justice sector in the UK. We explain the challenges involved in transcribing court room hearings and the Natural Language Processing (NLP) techniques we employ to tackle these challenges. We will show that fine-tuning a generic off-the-shelf pre-trained Automatic Speech Recognition (ASR) system with an in-domain language model as well as infusing common phrases extracted with a collocation detection model can improve not only the Word Error Rate (WER) of the transcribed hearings but avoid critical errors that are specific of the legal jargon and terminology commonly used in British courts.
翻译:然而,语音转录是一个昂贵而缓慢的过程。在本文中,我们描述了为联合王国司法部门专门设计的建立自动转录工具的合并研究和工业项目的一部分。我们解释了在转录法庭审判室听证和我们用来应对这些挑战的自然语言处理技术方面的挑战。我们将表明,微调一种通用的非现版预先培训的自动语音识别系统(ASR),使用一种主用语言模型,以及用合用探测模型提取的通用词句,不仅可以改进转录审讯的文字错误率,而且可以避免英国法院常用的法律术语和术语中特有的重大错误。