We study the problem of set discovery where given a few example tuples of a desired set, we want to find the set in a collection of sets. A challenge is that the example tuples may not uniquely identify a set, and a large number of candidate sets may be returned. Our focus is on interactive exploration to set discovery where additional example tuples from the candidate sets are shown and the user either accepts or rejects them as members of the target set. The goal is to find the target set with the least number of user interactions. The problem can be cast as an optimization problem where we want to find a decision tree that can guide the search to the target set with the least number of questions to be answered by the user. We propose a general algorithm that is capable of reaching an optimal solution and two variations of it that strike a balance between the quality of a solution and the running time. We also propose a novel pruning strategy that safely reduces the search space without introducing false negatives. We evaluate the efficiency and the effectiveness of our algorithms through an extensive experimental study using both real and synthetic datasets and comparing them to previous approaches in the literature. We show that our pruning strategy reduces the running time of the search algorithms by 2-5 orders of magnitude.
翻译:我们研究设定发现的问题, 给一个理想的数据集几个示例, 我们想要在一组集合中找到一组。 挑战在于, 示例tumples可能不会独有地识别一组数据集, 并且可能返回大量候选数据集。 我们的重点是互动探索, 以在显示候选数据集中更多示例图, 用户接受或拒绝它们作为设定目标成员的情况下设定发现 。 目标是找到设定的目标, 用户互动次数最少 。 问题可以作为一个优化问题出现。 我们想找到一个决策树, 指导对一组目标的搜索, 其回答的问题最少。 我们提出一个总算算算法, 能够达成最佳解决方案的质量与运行时间之间的平衡。 我们还提出一个新的调整战略, 安全地减少搜索空间, 而不会引入虚假的负值 。 我们通过使用真实和合成的数据集进行广泛的实验研究来评估我们的算法的效率和效果, 并将它们与先前的文献方法进行比较。 我们提出一个总算法, 通过运行时间序列2 来降低搜索的规模 。