Most existing approaches to disfluency detection heavily rely on human-annotated data, which is expensive to obtain in practice. To tackle the training data bottleneck, we investigate methods for combining multiple self-supervised tasks-i.e., supervised tasks where data can be collected without manual labeling. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled news data, and propose two self-supervised pre-training tasks: (i) tagging task to detect the added noisy words. (ii) sentence classification to distinguish original sentences from grammatically-incorrect sentences. We then combine these two tasks to jointly train a network. The pre-trained network is then fine-tuned using human-annotated disfluency detection training data. Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems (trained using the full dataset) by using less than 1% (1000 sentences) of the training data. Our method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard.
翻译:为了解决培训数据瓶颈问题,我们调查了将多种自监管任务(即监督任务)相结合的方法,可以不经人工标签收集数据。首先,我们通过随机添加或删除未贴标签的新闻数据中的单词来构建大型假培训数据,并提议两项自监管培训前任务:(一) 标记任务,以探测添加的吵闹词。 (二) 将原有句子与文法错误的句子区分开来。然后,我们将这两项任务结合起来,共同培训一个网络。然后,对预先培训的网络进行微调,使用附加说明的不便检测培训数据。通常使用的英文切换板测试集的实验结果显示,我们的方法与以前的系统(使用全数据集培训)相比,能够取得竞争性的性能,使用的培训数据不到1%(1 000句)。我们所培训的全数据方法大大超越了以前的方法,将英文切换板上的错误减少21%。