The word mover's distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, We introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.
翻译:移动器的距离是衡量两个文件相似性的一项基本技术。作为大规模毁灭性武器的柱石,它可以通过使用最佳运输配方来利用该词空间的基本几何学。关于大规模毁灭性武器的原始研究报告说,大规模毁灭性武器在各种数据集中大大优于典型基线,例如“一袋字”(BOW)和“TF-IDF”等。在本文中,我们指出,最初研究中的评价可能会产生误解。我们重新评估了大规模毁灭性武器的性能和古典基线,发现如果我们采用适当的预处理,即L1正常化,传统基线与大规模毁灭性武器具有竞争力。此外,我们在大规模毁灭性武器和L1正规化的“BOW”之间作了类比,发现不仅大规模毁灭性武器的性能,而且距离值也类似于高维空间的“BOW”的距离值。