Automatic Speech Recognition (ASR) systems can be trained to achieve remarkable performance given large amounts of manually transcribed speech, but large labeled data sets can be difficult or expensive to acquire for all languages of interest. In this paper, we review the research literature to identify models and ideas that could lead to fully unsupervised ASR, including unsupervised segmentation of the speech signal, unsupervised mapping from speech segments to text, and semi-supervised models with nominal amounts of labeled examples. The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum requirements for speech recognition. Identifying these limitations would help optimize the resources and efforts in ASR development for low-resource languages.
翻译:自动语音识别系统(ASR)可接受培训,以取得显著的成绩,因为有大量手工转录语音,但大型标签数据集可能很难或昂贵,难以为所有感兴趣的语文获取。在本文件中,我们审查研究文献,以确定可能导致完全不受监督的语音识别模型和想法,包括语音信号的无监督分解、语音段到文字的无监督映射,以及带有名义数量标签示例的半监督模型。研究的目的是确定仅从语音数据中可以学到的局限性,并了解语音识别的最起码要求。确定这些局限性将有助于优化语音识别系统开发中用于低资源语言的资源和努力。