The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
翻译:单词句中的文字关系往往使我们更多地了解文件的基本语义内容,而不是其实际文字。在这项工作中,我们提出了两种新奇的语义算法,称为“灵活莱茵链II”和“固定莱茵链II”。这些算法结合了来自词汇链的语义关系、以前来自词汇数据库的知识,以及作为构成单一系统的构件的文字嵌入的文字分配假设的稳健性。简言之,我们的方法有三个主要贡献:(一) 一套完全融合单词嵌入和词汇链的技巧;(二) 一种更强有力的语义表达法,考虑到文件中文字之间的潜在关系;以及(三) 轻质字嵌入模型,可以扩展到任何自然语言任务。我们打算评估预先训练过的模型的知识,以评价其在文件分类任务中的稳健性。我们提出的技术是用7个单词嵌入算法进行测试,使用5个不同的机器学习分解器对文件分类任务的6种情景进行测试。我们的结果显示,词系链和词嵌入式表达的文字支持州-艺术结果,甚至对照更复杂的系统。