The rise of social media has led to the increasing of comments on online forums. However, there still exists some invalid comments which were not informative for users. Moreover, those comments are also quite toxic and harmful to people. In this paper, we create a dataset for classifying constructive and toxic speech detection, named UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset) with 10,000 human-annotated comments. For these tasks, we proposed a system for constructive and toxic speech detection with the state-of-the-art transfer learning model in Vietnamese NLP as PhoBERT. With this system, we achieved 78.59% and 59.40% F1-score for identifying constructive and toxic comments separately. Besides, to have an objective assessment for the dataset, we implement a variety of baseline models as traditional Machine Learning and Deep Neural Network-Based models. With the results, we can solve some problems on the online discussions and develop the framework for identifying constructiveness and toxicity Vietnamese social media comments automatically.
翻译:社交媒体的兴起导致在线论坛的评论增加。 但是,仍有一些无效的评论没有为用户提供信息。 此外,这些评论也非常有毒,对人有害。 在本文中,我们创建了一个数据集,用于对建设性和有毒的语音检测进行分类,名为UIT-ViCTSD(越南建设性和有毒言语检测数据集),并配有10 000份附加说明的评论。为了完成这些任务,我们提议了一个建设性和有毒的语音检测系统,在越南国家语言方案(NLP)中采用最先进的传输模式,作为PhoBERT。有了这个系统,我们实现了78.59%和59.40%的F1核心,分别确定了建设性和有毒的评论。此外,为了对数据集进行客观评估,我们实施了各种基线模型,作为传统的机器学习和深神经网络模型。有了这些结果,我们可以解决在线讨论的一些问题,并自动开发确定越南社会媒体建设性和毒性评论的框架。