In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas. This includes pre-trained language models like BERT. To investigate on the potential of capturing deeper relations between lexical items and structures and to filter out redundant information, we propose to preserve the morphological, syntactic and other types of linguistic information by combining them with the raw tokens or lemmas. This means, for example, including parts-of-speech or dependency information within the used lexical features. The word embeddings can then be trained on the combinations instead of just raw tokens. It is also possible to later apply this method to the pre-training of huge language models and possibly enhance their performance. This would aid in tackling problems which are more sophisticated from the point of view of linguistic representation, such as detection of cyberbullying.
翻译:在大多数情况下,单词嵌入只从原始符号或在某些情况下从 Lemmas 中学习,这包括事先训练过的语言模型,如 BERT。为了调查在词汇项目和结构之间建立更深层关系的潜力,并过滤多余的信息,我们建议保留形态学、合成和其他类型的语言信息,将其与原始符号或 Lemmas 组合起来。这意味着,例如,包括使用过的词汇特征中的部分语音或依赖性信息。然后,单词嵌入可以接受组合培训,而不是仅仅使用原始符号。以后,也可以将这种方法应用到大型语言模型的预培训中,并可能提高它们的性能。这将有助于解决从语言代表的角度来说更为复杂的问题,例如检测网络欺凌。