A core operation in data discovery is to find joinable tables for a given table. Real-world tables include both unary and n-ary join keys. However, existing table discovery systems are optimized for unary joins and are ineffective and slow in the existence of n-ary keys. In this paper, we introduce MATE, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key. We design a filtering layer that uses a novel hash, XASH. This hash function encodes the syntactic features of all column values and aggregates them into a super key, which allows the system to efficiently prune tables with non-joinable rows. Our join discovery system is able to prune up to 1000x more false positives and leads to over 60x faster table discovery in comparison to state-of-the-art.
翻译:数据发现的核心操作是为某个表格寻找可加入的表格。 真实世界的表格包括单元和 n- 共键。 但是, 现有的表格发现系统为单元组合优化, 无效, 且 n- 密钥的存在速度缓慢 。 在本文中, 我们引入了 MATE, 这个基于表格发现系统, 利用一种新的散射指数, 使 n- 以散射为基础的索引能够通过一个空间高效超密钥加入发现 。 我们设计了一个过滤层, 使用新颖的散列, XASH 。 这个散列函数将所有列值的合成特性编码, 并将其合并成一个超级密钥, 使系统能够高效地用不可允许行的预留表。 我们的加入系统能够将1000x以上的假正数提取, 并导致60x以上与最新技术相比更快的表格发现 。