We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems from small-scale rich transcription-style and large-scale common transcription-style datasets. In spontaneous speech tasks, various speech phenomena such as fillers, word fragments, laughter and coughs, etc. are often included. While common transcriptions do not give special awareness to these phenomena, rich transcriptions explicitly convert them into special phenomenon tokens as well as textual tokens. In previous studies, the textual and phenomenon tokens were simultaneously estimated in an end-to-end manner. However, it is difficult to build accurate RT-ASR systems because large-scale rich transcription-style datasets are often unavailable. To solve this problem, our training method uses a limited rich transcription-style dataset and common transcription-style dataset simultaneously. The Key process in our semi-supervised learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset. To this end, we introduce style tokens which control phenomenon tokens are generated or not into transformer-based autoregressive modeling. We use this modeling for generating the pseudo-rich transcription-style datasets and for building RT-ASR system from the pseudo and original datasets. Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.
翻译:我们建议一种半监督的学习方法,从小规模的丰富转录式和大规模通用转录式数据集中建立端到端的丰富转录式自动语音识别(RT-ASR)系统。 在自发的演讲任务中,常常包括各种演讲现象,如填充器、字片、笑声和咳嗽等。虽然普通的抄录并不特别了解这些现象,但丰富的抄录将它们明确转换成特殊现象符号和文本符号。在以往的研究中,文本和现象符号是以端到端的方式同时估算的。然而,由于大规模丰富的转录式数据集往往无法使用,因此很难建立准确的RT-ASR系统。为了解决这个问题,我们的培训方法使用有限的丰富转录式数据集和普通转录式数据集。我们半监督学习的关键进程是将普通的正本转录录式数据集转换成一个伪富正本的转录制式数据集。为此,我们引入了风格的代谢式代写式代录式代谢系统,用于控制我们变现或变现的缩制模型。