Data discovery systems help users identify relevant data among large table collections. Users express their discovery needs with a program or a set of keywords. Users may express complex queries using programs but it requires expertise. Keyword search is accessible to a larger audience but limits the types of queries supported. An interesting approach is learned discovery systems which find tables given natural language questions. Unfortunately, these systems require a training dataset for each table collection. And because collecting training data is expensive, this limits their adoption. In this paper, we introduce a self-supervised approach to assemble training datasets and train learned discovery systems without human intervention. It requires addressing several challenges, including the design of self-supervised strategies for data discovery, table representation strategies to feed to the models, and relevance models that work well with the synthetically generated questions. We combine all the above contributions into a system, S2LD, that solves the problem end to end. The evaluation results demonstrate the new techniques outperform state-of-the-art approaches on wellknown benchmarks. All in all, the technique is a stepping stone towards building learned discovery systems. The code is open-sourced at https://github.com/TheDataStation/open_table_discovery.
翻译:用户可以通过一个程序或一组关键词表达其发现需求。用户可以使用程序表达复杂的查询,但需要专门知识。关键字搜索可以向更多的受众开放,但可以限制所支持的查询类型。一个有趣的方法是学习发现系统,找到自然语言问题的表格。不幸的是,这些系统需要为每个表格的收集工作建立一个培训数据集。由于收集培训数据费用昂贵,因此限制了它们的采用。在本文中,我们引入了一种自我监督的方法,在没有人类干预的情况下收集培训数据集和培训学习的发现系统。它需要应对若干挑战,包括设计自我监督的数据发现战略、向模型提供反馈的表格代表战略和与合成生成的问题相配合的关联模型。我们将上述所有贡献整合到一个系统,即S2LD,最终解决问题。评价结果显示新技术在众所周知的基准上超越了最先进的状态方法。所有技术都是建设学习性发现系统的垫石。代码在 https://sopenatasultive.thebucom中是开源的。