LARD: 大规模人造碎片生成 (LARD: Large-scale Artificial Disfluency Generation)

Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the model on rare classes, as it is demonstrated in this paper. To this end, we propose LARD, a method for generating complex and realistic artificial disfluencies with little effort. The proposed method can handle three of the most common types of disfluencies: repetitions, replacements and restarts. In addition, we release a new large-scale dataset with disfluencies that can be used on four different tasks: disfluency detection, classification, extraction and correction. Experimental results on the LARD dataset demonstrate that the data produced by the proposed method can be effectively used for detecting and removing disfluencies, while also addressing limitations of existing datasets.

翻译：在实时对话系统中,发现不确定性是一项关键任务,然而,尽管它很重要,但它仍然是一个相对未探索的领域,主要原因是缺乏适当的数据集。与此同时,现有数据集存在各种问题,包括阶级不平衡问题,如本文件所示,这些问题会严重影响稀有类别模型的性能。为此,我们提议,LARD是造成复杂和现实的人工错乱的一种方法,很少费力。拟议方法可以处理三种最常见的易失常类型:重复、替换和重新启动。此外,我们发布新的大规模失常数据集,可用于四种不同的任务:检测、分类、提取和校正。LARD数据集的实验结果表明,拟议方法产生的数据可以有效地用于探测和消除不便,同时解决现有数据设置的局限性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日