This paper identifies and addresses dynamic selection problems that arise in online learning algorithms with endogenous data. In a contextual multi-armed bandit model, we show that a novel bias (self-fulfilling bias) arises because the endogeneity of the data influences the choices of decisions, affecting the distribution of future data to be collected and analyzed. We propose a class of algorithms to correct for the bias by incorporating instrumental variables into leading online learning algorithms. These algorithms lead to the true parameter values and meanwhile attain low (logarithmic-like) regret levels. We further prove a central limit theorem for statistical inference of the parameters of interest. To establish the theoretical properties, we develop a general technique that untangles the interdependence between data and actions.
翻译:本文指出并解决了在使用内生数据的在线学习算法中出现的动态选择问题。 在一种背景性多武装土匪模型中,我们显示出现了一种新的偏差(自我实现偏差),因为数据的内分性影响决定的选择,影响未来需要收集和分析的数据的分布。我们建议了一种算法,通过将工具变量纳入主要的在线学习算法来纠正偏差。这些算法导致真实的参数值,并同时达到低(类似对数)遗憾程度。我们进一步证明对利益参数的统计推论有一个核心限制。为了确立理论属性,我们开发了一种总的技术来消除数据和行动之间的相互依存关系。