Modelling taxonomic and thematic relatedness is important for building AI with comprehensive natural language understanding. The goal of this paper is to learn more about how taxonomic information is structurally encoded in embeddings. To do this, we design a new hypernym-hyponym probing task and perform a comparative probing study of taxonomic and thematic SGNS and GloVe embeddings. Our experiments indicate that both types of embeddings encode some taxonomic information, but the amount, as well as the geometric properties of the encodings, are independently related to both the encoder architecture, as well as the embedding training data. Specifically, we find that only taxonomic embeddings carry taxonomic information in their norm, which is determined by the underlying distribution in the data.
翻译:模拟分类学和主题关联性对于以全面的自然语言理解建立AI十分重要。本文件的目的是更多地了解分类学信息是如何在嵌入中进行结构编码的。 为此,我们设计了一个新的超音速同步探测任务,并对分类学和专题 SGNS 和 GloVe 嵌入进行比较研究。我们的实验表明,两种嵌入类型都包含了某些分类学信息,但编码的数量和几何特性与编码器结构以及嵌入的培训数据都独立相关。具体地说,我们发现只有分类学嵌入在其规范中含有分类学信息,而分类学信息是由数据的基本分布决定的。