Causal bandit problem integrates causal inference with multi-armed bandits. The pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we can choose to either intervene one variable or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide first gap-dependent fully adaptive pure exploration algorithms on three types of causal models including parallel graphs, general graphs with small number of backdoor parents, and binary generalized linear models. Our algorithms improve both prior causal bandit algorithms, which are not adaptive to reward gaps, and prior adaptive pure exploration algorithms, which do not utilize the special features of causal bandits.
翻译:因果强盗问题包含多武装强盗的因果推断。对因果强盗的纯粹探索是以下在线学习任务:在每轮中,我们可以选择干预一个变数或不干预,观察所有随机变数的随机结果,目标是尽可能多地使用几轮,我们可以输出出一个能给奖励变数最佳(或几乎最佳)的预期结果的干预措施,其概率至少为1美元-德尔塔元,其中美元为某种信任水平。我们为三种类型的因果模型提供了第一种完全依赖差距的完全适应性纯勘探算法,其中包括平行图形、有少量后门家长的普通图形和二元通用线性模型。我们的算法改进了以前的因果强盗算法,这些算法并不适应于奖励差距,而改进了以前的适应性纯粹勘探算法,这些算法没有利用因果强盗的特殊特征。