Building a natural language dataset requires caution since word semantics is vulnerable to subtle text change or the definition of the annotated concept. Such a tendency can be seen in generative tasks like question-answering and dialogue generation and also in tasks that create a categorization-based corpus, like topic classification or sentiment analysis. Open-domain conversations involve two or more crowdworkers freely conversing about any topic, and collecting such data is particularly difficult for two reasons: 1) the dataset should be ``crafted" rather than ``obtained" due to privacy concerns, and 2) paid creation of such dialogues may differ from how crowdworkers behave in real-world settings. In this study, we tackle these issues when creating a large-scale open-domain persona dialogue corpus, where persona implies that the conversation is performed by several actors with a fixed persona and user-side workers from an unspecified crowd.
翻译:构建自然语言数据集需要谨慎,因为单词语义容易受到微小文本变化或注释概念定义的影响。这种趋势可以在生成任务如问答、对话生成中看到,也可以在创建基于分类的语料库如主题分类或情感分析中看到。开放域的对话包括两个或多个众包工作者自由谈论任何主题,收集这样的数据尤其困难,原因有两个: 1) 由于隐私问题,数据集应该是“制作的”,而不是“获取的”; 2) 有偿创建这样的对话可能不同于众包工作者在实际世界中的行为方式。在这项研究中,我们在创建一种大规模的开放域人物角色对话语料库时解决了这些问题,其中人物角色意味着对话是由固定人物角色的多个演员和来自未指定的众包的用户侧工作者执行的。