For automatically identifying hate speech and offensive content in tweets, a system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed by the SATLab team. After its optimization in terms of the feature weighting and the classifier parameters, it reached, in the multilingual HASOC 2021 challenge, a medium performance level in English, the language for which it is easy to develop deep learning approaches relying on many external linguistic resources, but a far better level for the two less resourced language, Hindi and Marathi. It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches. These performances suggest that it is an interesting reference level to evaluate the benefits of using more complex approaches such as deep learning or taking into account complementary resources.
翻译:为了在推文中自动识别仇恨言论和冒犯性内容,SATLab小组建议采用一种基于古典监督算法的系统,仅配有字符 n 克,从而完全的语言不可知性,在特征加权和分类参数方面优化后,在多语种HasOC 2021挑战中,它达到了中等英文水平,这种语言很容易根据许多外部语言资源发展深层次的学习方法,但对于两种资源较少的语言,即印地语和马拉地语来说,则要有一个更好的水平。即使首先在三种语言的成绩中平均表现优于许多深层次的学习方法时,它也会结束。这些表现表明,评估使用更复杂的方法,如深层次学习或考虑到补充资源,其好处是一个有趣的参考水平。