A common form of MapReduce application involves discovering relationships between certain pairs of inputs. Similarity joins serve as a good example of this type of problem, which we call a "some-pairs" problem. In the framework of Afrati et al. (VLDB 2013), algorithms are measured by the tradeoff between reducer size (maximum number of inputs a reducer can handle) and the replication rate (average number of reducers to which an input must be sent. There are two obvious approaches to solving some-pairs problems in general. We show that no general-purpose MapReduce algorithm can beat both of these two algorithms in the worst case. We then explore a recursive algorithm for solving some-pairs problems and heuristics for beating the lower bound on common instances of the some-pairs class of problems.
翻译:常见的 MapReduce 应用程序形式涉及发现某些投入对口之间的关系。 相似性是这类问题的一个好例子, 我们称之为“ 某些皮质” 问题。 在 Afrati 等人( VLDB 2013) 的框架内, 算法是通过缩小体积( 减少体能处理的最大投入数量) 和复制率( 必须发送投入的减少体平均数量) 之间的权衡来衡量的。 解决某些皮质问题有两种明显的方法。 我们显示, 在最坏的情况下, 通用的 MapRduce 算法无法击败这两种算法。 然后我们探索一种循环算法, 解决某些皮质问题, 以及在某些皮质类问题常见情况下击败较低约束的重力。