With the ever-increasing availability of digital information, toxic content is also on the rise. Therefore, the detection of this type of language is of paramount importance. We tackle this problem utilizing a combination of a state-of-the-art pre-trained language model (CharacterBERT) and a traditional bag-of-words technique. Since the content is full of toxic words that have not been written according to their dictionary spelling, attendance to individual characters is crucial. Therefore, we use CharacterBERT to extract features based on the word characters. It consists of a CharacterCNN module that learns character embeddings from the context. These are, then, fed into the well-known BERT architecture. The bag-of-words method, on the other hand, further improves upon that by making sure that some frequently used toxic words get labeled accordingly.
翻译:随着数字信息的不断增多,有毒内容也在增加。 因此, 检测这类语言至关重要。 我们利用先进的预培训语言模型( characterBERT) 和传统的一袋字技术来解决这个问题。 由于内容中充满了没有根据字典拼写而写的有毒词, 注意单个字符是至关重要的。 因此, 我们使用字符BERT来提取基于字词字符的特征。 它包括一个从上下文中学习字符嵌入的字符的字符CNN模块。 然后, 这些模块被反馈到众所周知的BERT结构中。 另一方面, 袋字方法通过确保一些经常使用的有毒词被贴上相应的标签来进一步改进。