Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the model on rare classes, as it is demonstrated in this paper. To this end, we propose LARD, a method for generating complex and realistic artificial disfluencies with little effort. The proposed method can handle three of the most common types of disfluencies: repetitions, replacements and restarts. In addition, we release a new large-scale dataset with disfluencies that can be used on four different tasks: disfluency detection, classification, extraction and correction. Experimental results on the LARD dataset demonstrate that the data produced by the proposed method can be effectively used for detecting and removing disfluencies, while also addressing limitations of existing datasets.
翻译:在实时对话系统中,发现不确定性是一项关键任务,然而,尽管它很重要,但它仍然是一个相对未探索的领域,主要原因是缺乏适当的数据集。与此同时,现有数据集存在各种问题,包括阶级不平衡问题,如本文件所示,这些问题会严重影响稀有类别模型的性能。为此,我们提议,LARD是造成复杂和现实的人工错乱的一种方法,很少费力。拟议方法可以处理三种最常见的易失常类型:重复、替换和重新启动。此外,我们发布新的大规模失常数据集,可用于四种不同的任务:检测、分类、提取和校正。LARD数据集的实验结果表明,拟议方法产生的数据可以有效地用于探测和消除不便,同时解决现有数据设置的局限性。