Natural language dialogue systems raise great attention recently. As many dialogue models are data-driven, high-quality datasets are essential to these systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization, deduplication, segmentation, and filtering. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models. Besides, current dialogue datasets for personalized chatbot usually contain several persona sentences or attributes. Different from existing datasets, Pchatbot provides anonymized user IDs and timestamps for both posts and responses. This enables the development of personalized dialogue models that directly learn implicit user personality from the user's dialogue history. Our preliminary experimental study benchmarks several state-of-the-art dialogue models to provide a comparison for future work. The dataset can be publicly accessed at Github.
翻译:自然语言对话系统最近引起极大关注。 许多对话模式都是数据驱动的, 高质量的数据集对这些系统至关重要 。 在本文中, 我们引入了 Pchatbot, 这是一个大型对话数据集, 包含从 Weibo 和司法论坛分别收集的两个子集。 为了将原始数据集适应对话系统, 我们通过匿名化、 解重复、 分解和过滤等程序将原始数据集详细化。 Pchatbot 的规模大大大于现有的中国数据集, 这可能会有益于数据驱动模型 。 此外, 当前个人化聊天平台的对话数据集通常包含几个人文句或属性 。 与现有数据集不同, Pchatbot 提供匿名用户身份和时间标码, 用于功能和响应。 这样可以开发个人化对话模式, 直接从用户对话历史中学习隐含的用户个性。 我们的初步实验研究基准了几个最先进的对话模型, 以提供未来工作的比较。 数据集可以在 Githhub 公开访问 。