The Word Mover's Distance (WMD) is a metric that measures the semantic dissimilarity between two text documents by computing the cost of moving all words of a source/query document to the most similar words of a target document optimally. Computing WMD between two documents is costly because it requires solving an optimization problem that costs \(O(V^3log(V))\) where \(V\) is the number of unique words in the document. Fortunately, the WMD can be framed as the Earth Mover's Distance (EMD) (also known as the Optimal Transportation Distance) for which it has been shown that the algorithmic complexity can be reduced to \(O(V^2)\) by adding an entropy penalty to the optimization problem and a similar idea can be adapted to compute WMD efficiently. Additionally, the computation can be made highly parallel by computing WMD of a single query document against multiple target documents at once (e.g., finding whether a given tweet is similar to any other tweets happened in a day). In this paper, we present a shared-memory parallel Sinkhorn-Knopp Algorithm to compute the WMD of one document against many other documents by adopting the \(O(V^2)\) EMD algorithm. We used algorithmic transformations to change the original dense compute-heavy kernel to a sparse compute kernel and obtained \(67\times\) speedup using \(96\) cores on the state-of-the-art of Intel\textregistered{} 4-sockets Cascade Lake machine w.r.t. its sequential run. Our parallel algorithm is over \(700\times\) faster than the naive parallel python code that internally uses optimized matrix library calls.
翻译:Word Moler 的距离( Word 70) 是测量两个文本文档之间语义变异度的一种度量, 计算将源/ query 文档的所有字词移动到目标文档最相似的字词的成本。 计算两个文档之间的大规模毁灭性武器成本是昂贵的, 因为它需要解决一个优化问题, 也就是( O) (V) 3log( V)\ ) 是文档中独有词数。 幸运的是, 大规模毁灭性武器可以被设置为 Earth Moler 的距离( 也称为 优化运输距离 ) (EMD ) 。 对此,我们已经显示, 算法的复杂性可以降低到 most 文档中最相似的字数 。 在本文中, 我们用一个共振动的直径直线解码- likeyal likeyal- laxlational ormaxal 。 我们用一个直线的直线的 Sinkhorn- kopreal- rmalational 2, 使用一个正值的直径直径O- massal- massal- mission 4cal- missional- mession 。 和我们的直径解的直径的直径对一个正序文档的直径变。