将NLP技术从分类技术扩展至原始空间: KL 差异、 Zipf 法和相似性搜索 (On Extending NLP Techniques from the Categorical to the Latent Space: KL Divergence, Zipf's Law, and Similarity Search)

Despite the recent successes of deep learning in natural language processing (NLP), there remains widespread usage of and demand for techniques that do not rely on machine learning. The advantage of these techniques is their interpretability and low cost when compared to frequently opaque and expensive machine learning models. Although they may not be be as performant in all cases, they are often sufficient for common and relatively simple problems. In this paper, we aim to modernize these older methods while retaining their advantages by extending approaches from categorical or bag-of-words representations to word embeddings representations in the latent space. First, we show that entropy and Kullback-Leibler divergence can be efficiently estimated using word embeddings and use this estimation to compare text across several categories. Next, we recast the heavy-tailed distribution known as Zipf's law that is frequently observed in the categorical space to the latent space. Finally, we look to improve the Jaccard similarity measure for sentence suggestion by introducing a new method of identifying similar sentences based on the set cover problem. We compare the performance of this algorithm against several baselines including Word Mover's Distance and the Levenshtein distance.

翻译：尽管在自然语言处理(NLP)方面最近取得了深层次学习的成功,但目前仍然广泛使用和需要不依赖机器学习的技术。这些技术的优点是,与经常不透明、昂贵的机器学习模式相比,这些技术的可解释性和成本低。虽然它们并非在所有情况下都一样,但它们往往足以解决共同和相对简单的问题。在本文件中,我们的目标是将这些老方法现代化,同时保留其优点,将这些老方法从绝对或一袋字表达法扩大到隐蔽空间内的文字嵌入。首先,我们表明,使用字嵌嵌入法和Kullback-Leiper可以有效地估计进和Kullback-Liper的差异,并使用这一估计来比较不同类别的文本。接下来,我们重新翻译了在绝对空间到潜在空间时经常看到的称为Zipf法的繁琐分流分布。最后,我们寻求改进Jaccard类似判决措施的建议,采用新的方法,根据这套方法确定相似的句号覆盖问题。我们比较了这一算法的性与几个基线,包括Word Moler的距离和Levenshtein。