Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on CONDA, providing fine-grained results for different intent classes and slot classes. Furthermore, we examine the coverage of toxicity nature in our dataset by comparing it with other toxicity datasets.
翻译:传统毒性检测模型侧重于单一语句水平,而没有更深入地理解上下文。我们引入了CONDA,这是一个用于游戏中毒性语言检测的新的数据集,可以进行联合意图分类和空缺填充分析,这是自然语言理解的核心任务。该数据集由来自1.9K已完成的Dota 2匹配聊天日志12K对话的45K语句组成。我们提出了一个强有力的双语级毒性框架,处理语句和象征性模式,以及丰富的背景聊天历史。该数据集是一场彻底的游戏中的毒性分析,全面了解语句、符号和双层的背景。在NLU的启发下,我们还运用其测量毒性的度量来评估毒性和游戏特有方面。我们评估了CONDA的强力NLU模型,为不同的意向类别和档级提供了精确的结果。此外,我们通过将这些数据与其他毒性数据集进行比较,审视了我们数据集中的毒性特性的覆盖范围。