Fuzzy similarity join is an important database operator widely used in practice. So far the research community has focused exclusively on optimizing fuzzy join \textit{scalability}. However, practitioners today also struggle to optimize fuzzy-join \textit{quality}, because they face a daunting space of parameters (e.g., distance-functions, distance-thresholds, tokenization-options, etc.), and often have to resort to a manual trial-and-error approach to program these parameters in order to optimize fuzzy-join quality. This key challenge of automatically generating high-quality fuzzy-join programs has received surprisingly little attention thus far. In this work, we study the problem of "auto-program" fuzzy-joins. Leveraging a geometric interpretation of distance-functions, we develop an unsupervised \textsc{Auto-FuzzyJoin} framework that can infer suitable fuzzy-join programs on given input tables, without requiring explicit human input such as labeled training data. Using \textsc{Auto-FuzzyJoin}, users only need to provide two input tables $L$ and $R$, and a desired precision target $\tau$ (say 0.9). \textsc{Auto-FuzzyJoin} leverages the fact that one of the input is a reference table to automatically program fuzzy-joins that meet the precision target $\tau$ in expectation, while maximizing fuzzy-join recall (defined as the number of correctly joined records). Experiments on both existing benchmarks and a new benchmark with 50 fuzzy-join tasks created from Wikipedia data suggest that the proposed \textsc{Auto-FuzzyJoin} significantly outperforms existing unsupervised approaches, and is surprisingly competitive even against supervised approaches (e.g., Magellan and DeepMatcher) when 50\% of ground-truth labels are used as training data.
翻译:模糊相似的连接是一个重要的数据库操作器, 它在实践中广泛使用 。 因此, 研究界一直完全专注于优化 fuzzy 加入 \ textit{ 缩放} 。 然而, 今天的执业者也在努力优化 fuzzy- join\ textit{ 质量} 。 因为他们面临着一个令人生畏的参数空间( 例如, 远程功能、 远程锁定、 符号化选项等 ), 并且经常不得不使用手动试运行器来编程这些参数, 以便优化 fuzzy- join 质量 。 这个自动生成高质量 flozzy- join 程序的关键挑战迄今为止很少受到关注 。 在这项工作中, 我们研究“ utoprogy- progy” fuzzy- join 问题, 因为他们对远程功能进行地理解释, 我们开发了一个不超超超超强的 & futocreal- furzy- join 格式化的游戏框架, 它只能用来在输入表格上提供最精确的 $Fn- testal- tal- dexx) 数据。 和 。