Offensive language is pervasive in social media. Individuals frequently take advantage of the perceived anonymity of computer-mediated communication, using this to engage in behavior that many of them would not consider in real life. The automatic identification of offensive content online is an important task that has gained more attention in recent years. This task can be modeled as a supervised classification problem in which systems are trained using a dataset containing posts that are annotated with respect to the presence of some form(s) of abusive or offensive content. The objective of this study is to provide a description of a classification system built for SemEval-2019 Task 6: OffensEval. This system classifies a tweet as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B \& C). We trained machine learning and deep learning models along with data preprocessing and sampling techniques to come up with the best results. Models discussed include Naive Bayes, SVM, Logistic Regression, Random Forest and LSTM.
翻译:在社交媒体中,攻击性语言十分普遍。个人经常利用计算机中介通信的匿名感,利用这种匿名感,从事他们中许多人在现实生活中不会考虑的行为。自动识别攻击性在线内容是一项重要任务,近年来受到更多关注。这项任务可以作为一种监督分类问题进行示范,在这种分类中,利用数据集对各系统进行培训,其中含有与存在某些形式的虐待或攻击性内容有关的附加说明的数据集。本研究的目的是描述为SemEval-2019任务6:OffenensEval建立的一个分类系统:该系统将推文归类为攻击性或非攻击性(Sub-task A),并进一步将攻击性推文分类为类别(Sub-task B ⁇ C)。我们培训了机器学习和深层学习模式,并培训了数据预处理和取样技术,以得出最佳结果。讨论的模式包括Naive Bayes、SVM、后勤倒退、随机森林和LSTM。