The development of state-of-the-art (SOTA) Natural Language Processing (NLP) systems has steadily been establishing new techniques to absorb the statistics of linguistic data. These techniques often trace well-known constructs from traditional theories, and we study these connections to close gaps around key NLP methods as a means to orient future work. For this, we introduce an analytic model of the statistics learned by seminal algorithms (including GloVe and Word2Vec), and derive insights for systems that use these algorithms and the statistics of co-occurrence, in general. In this work, we derive -- to the best of our knowledge -- the first known solution to Word2Vec's softmax-optimized, skip-gram algorithm. This result presents exciting potential for future development as a direct solution to a deep learning (DL) language model's (LM's) matrix factorization. However, we use the solution to demonstrate a seemingly-universal existence of a property that word vectors exhibit and which allows for the prophylactic discernment of biases in data -- prior to their absorption by DL models. To qualify our work, we conduct an analysis of independence, i.e., on the density of statistical dependencies in co-occurrence models, which in turn renders insights on the distributional hypothesis' partial fulfillment by co-occurrence statistics.
翻译:最先进的自然语言处理系统(SOTA)的开发稳步地建立了吸收语言数据统计数据的新技术。这些技术常常追踪传统理论中众所周知的构思,我们研究这些连接以缩小主要国家语言处理方法的缺口,以此指导今后的工作。为此,我们引入了一种分析模型,分析通过原始算法(包括GloVe和Word2Vec)获得的统计数据,为使用这些算法的系统以及一般共同使用的统计数据的系统提供了深入了解。在这项工作中,我们从我们的知识中获取 -- -- 最先进的知识 -- -- 是Word2Vec软式最佳算法和跳过算法的第一个已知解决方案,我们研究这些连接以缩小关键国家语言处理方法的空白,作为未来发展的一个直接解决方案,以深入学习(DL)语言模型(包括GloVe和Word2Vec)的矩阵要素化。然而,我们使用这一解决方案来展示一种似乎普遍存在的部分语言矢量展览的属性,并使得我们能够对数据中的偏向性进行预防性的辨识 -- -- 在数据中,通过统计密度模型进行吸收之前,将数据正统化分析。