Automatic Speech Recognition (ASR) is a technology that converts spoken words into text, facilitating interaction between humans and machines. One of the most common applications of ASR is Speech-To-Text (STT) technology, which simplifies user workflows by transcribing spoken words into text. In the medical field, STT has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings. However, developing an STT model for the medical domain is challenging due to the lack of sufficient speech and text datasets. To address this issue, we propose a medical-domain text correction method that modifies the output text of a general STT system using the Vision Language Pre-training (VLP) method. VLP combines textual and visual information to correct text based on image knowledge. Our extensive experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field. We further show that multi-modal understanding of image and text information outperforms single-modal understanding using only text information.
翻译:自动语音识别(ASR)是一种将口语转换成文字、促进人与机器之间互动的技术。ASR最常用的应用之一是语音到文字(STT)技术,该技术通过将口语转换成文字简化用户工作流程。在医疗领域,STT有可能大大减少依赖打字员进行语音录音转换的临床医生的工作量。然而,由于没有足够的语音和文字数据集,为医疗领域开发STT模式具有挑战性。为解决这一问题,我们建议采用医学-内容文本校正方法,用视觉语言预培训(VLP)方法修改一般STT系统的输出文本。VLP将文本和视觉信息结合起来,以根据图像知识校正文字。我们的广泛实验表明,拟议的方法在数量上和临床上都能够极大地改进STT在医疗领域的性能。我们进一步表明,对图像和文字信息的多模式理解超越了仅使用文本信息的单一模式的理解。</s>