The rise of social media has led to the increasing of comments on online forums. However, there still exists invalid comments which are not informative for users. Moreover, those comments are also quite toxic and harmful to people. In this paper, we create a dataset for constructive and toxic speech detection, named UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset) with 10,000 human-annotated comments. For these tasks, we propose a system for constructive and toxic speech detection with the state-of-the-art transfer learning model in Vietnamese NLP as PhoBERT. With this system, we obtain F1-scores of 78.59% and 59.40% for classifying constructive and toxic comments, respectively. Besides, we implement various baseline models as traditional Machine Learning and Deep Neural Network-Based models to evaluate the dataset. With the results, we can solve several tasks on the online discussions and develop the framework for identifying constructiveness and toxicity of Vietnamese social media comments automatically.
翻译:社交媒体的兴起导致在线论坛的评论增加。 但是,仍然存在一些对用户来说没有信息内容的无效评论。 此外,这些评论对人们也具有相当的毒性和伤害性。在本文中,我们创建了建设性和有毒言语检测数据集,名为UIT-ViCTSD(越南建设性和有毒言语检测数据集),有10,000个附加说明的评论。关于这些任务,我们建议建立一个建设性和有毒言语检测系统,在越南NLP作为PhoBERT中采用最先进的传导学习模式。有了这个系统,我们分别获得了78.59%和59.40%的F1分数,用于对建设性和有毒言语进行分类。此外,我们实施了各种基线模型,作为传统的机器学习和深神经网络模型来评估数据集。通过这些结果,我们可以解决关于在线讨论的几项任务,并自动开发确定越南社会媒体评论的建设性和毒性的框架。