Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.
翻译:数据是机器学习和因果推断任务的核心组成部分。来自开放数据仓库、数据湖和数据市场等来源的大量数据的可用性为数据增强和提高这些任务的性能创造了机会。然而,增强技术依赖于用户手动发现和筛选有用的候选增强。现有解决方案未利用发现和增强之间的协同作用,因此未充分利用数据。在本文中,我们介绍了 METAM,一种新型的基于目标的框架,该框架通过向下游任务查询候选数据集来形成反馈循环,自动引导发现和增强过程。为了有效地选择候选者,METAM利用数据的性质、效用函数和解决方案集大小的特性。我们展示了 METAM 的理论保证,并在广泛的任务集上进行了实证证明。总的来说,我们展示了基于目标的数据发掘对现代数据科学应用的前景。