The Word Movers Distance (WMD) measures the semantic dissimilarity between two text documents by computing the cost of optimally moving all words of a source/query document to the most similar words of a target document. Computing WMD between two documents is costly because it requires solving an $O(V^3log(V))$ optimization problem where $V$ is the number of unique words in the document. Fortunately, WMD can be framed as an Earth Mover's Distance (EMD) for which the algorithmic complexity can be reduced to $O(V^2)$ by adding an entropy penalty to the optimization problem and solving it using the Sinkhorn-Knopp algorithm. Additionally, the computation can be made highly parallel by adopting a batching approach, i.e., computing the WMD of a single query document against multiple target documents at once. Sinkhorn WMD is a key kernel used in many ML/NLP applications. and usually gets implemented in Python. However, a straightforward Python implementation may leave significant performance on the table even though it may internally call optimized C++ BLAS routines. We present a new sparse {P}arallel {A}lgorithm for {S}inkhorn-Knopp {W}ord-movers {D}istance to compute the semantic distance of one document to many other documents by adopting the $O(V^2)$ EMD algorithm. We algorithmically transform $O(V^2)$ dense compute-heavy EMD version into an equivalent sparse one using new fused SDDMM-SpMM (sparse selection of dense-dense matrix-, sparse-dense matrix-multiplication) kernels. We implemented and optimized this algorithm for two very different architectures -- the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) and Intel Xeon CPUs. We show that we were able to reach close to peak performance on both platforms.
翻译:Word Molers Learter (MW) 测量两个文本文件之间的语义差异, 计算将源/ 请求文档的所有字词优化移动到目标文档最相似的字词的成本。 在两个文档之间计算大规模毁灭性武器的成本是昂贵的, 因为它需要一次性解决$O (V3log(V)) $$(V$) 的优化问题, 而美元是文档中唯一单词数。 幸运的是, 大规模毁灭性武器可以被设置为地球移动器距离( EMD), 其算法复杂性可以降低到$O (V2) $( V2) 。 然而, 直接的 Python 执行可能会留下显著的性能, 尽管它可能内部调用直角- commeria- or- knickal 算出一个直径等值 EVlickral- krmals 的 EVlickrickral 。