With the rise of voice chat rooms, a gigantic resource of data can be exposed to the research community for natural language processing tasks. Moderators in voice chat rooms actively monitor the discussions and remove the participants with offensive language. However, it makes the hate speech detection even more difficult since some participants try to find creative ways to articulate hate speech. This makes the hate speech detection challenging in new social media like Clubhouse. To the best of our knowledge all the hate speech datasets have been collected from text resources like Twitter. In this paper, we take the first step to collect a significant dataset from Clubhouse as the rising star in social media industry. We analyze the collected instances from statistical point of view using the Google Perspective Scores. Our experiments show that, the Perspective Scores can outperform Bag of Words and Word2Vec as high level text features.
翻译:随着语音聊天室的兴起,一个巨大的数据资源可以提供给研究界,用于自然语言处理任务。语音聊天室的主持人积极监测讨论,用冒犯性语言将参与者除名。然而,由于一些参与者试图寻找解释仇恨言论的创造性方法,这使得仇恨言论的发现更加困难。这使得在俱乐部等新的社交媒体中仇恨言论的发现具有挑战性。据我们所知,所有仇恨言论数据集都是从Twitter等文本资源中收集的。在本文中,我们迈出了第一步,从Chabhouse(社交媒体行业新兴明星)那里收集重要的数据集。我们用Google透视计分析从统计角度收集的事例。我们的实验显示,“视野评分”可以超越Words和Word2Vec(Words和Word2Vec)的高级文本功能。