Researchers often face data fusion problems, where multiple data sources are available, each capturing a distinct subset of variables. While problem formulations typically take the data as given, in practice, data acquisition can be an ongoing process. In this paper, we aim to estimate any functional of a probabilistic model (e.g., a causal effect) as efficiently as possible, by deciding, at each time, which data source to query. We propose online moment selection (OMS), a framework in which structural assumptions are encoded as moment conditions. The optimal action at each step depends, in part, on the very moments that identify the functional of interest. Our algorithms balance exploration with choosing the best action as suggested by current estimates of the moments. We propose two selection strategies: (1) explore-then-commit (OMS-ETC) and (2) explore-then-greedy (OMS-ETG), proving that both achieve zero asymptotic regret as assessed by MSE. We instantiate our setup for average treatment effect estimation, where structural assumptions are given by a causal graph and data sources may include subsets of mediators, confounders, and instrumental variables.
翻译:研究者往往面临数据融合问题,在有多种数据来源的情况下,每个研究者都往往面临数据融合问题,每个研究者都捕捉不同的一组变量。虽然问题配方通常采用所提供的数据,但在实践中,数据采集可能是一个持续的过程。在本文中,我们的目标是尽可能有效地估计概率模型(例如因果关系)的任何功能(例如因果关系),每次决定哪个数据源可以查询。我们建议在线时间选择(OMS),一个框架,将结构假设编码为即时条件。每个步骤的最佳行动部分取决于确定利益功能的时刻。我们的算法平衡了探索与选择当前对时间的估计所建议的最佳行动之间的平衡。我们提出了两个选择战略:(1) 探索-当期承诺(OMS-ETC)和(2) 探索-当期承诺(OMS-ETG),证明两者均达到由MSE评估的零度和症状的遗憾。我们为平均治疗效果估算设定的设置,其中的结构性假设可能包含由因果图表和数据源提供的分类。