The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -- the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.
翻译:对因果强盗的分类纯粹探索是下述在线学习任务:在每回合中,我们选择一组变量来干预或不干预,观察所有随机变量的随机结果,目标是尽可能多地使用几轮,我们就能产生出一个最佳(或几乎最佳)的干预结果,使奖励变量的预期结果产生最佳(或最佳)Y$,概率至少为1美元或德尔塔元,其中美元为某种信任水平。我们为两种类型的因果模型提供了第一个基于差距和完全适应性的纯勘探算法,即二元通用线性模型(BGLM)和一般图表。对于BGLM来说,我们的算法是第一个专门设计用于这一设置并达到多数值样本复杂性的,而一般图表的所有现有算法要么样本复杂度至少为1美元或一些不合理的假设。对于一般图表来说,我们的算法提供了样本复杂性的重大改进,而且几乎与我们所证明的较低约束度相符。我们的算法通过对两种类型的因果关系模型进行新的整合,即以前的因果宽度直线模型(BGLMM)和一般图表,我们算算法是第一个专门设计为这一设置的组合组合,而先为这一设置是用于这一设置的,而后再采用前的因果性分析的反向导的,而后演算法,而后演算法则则则则则则则则是是先使用前的,而采用前的因果性分析,而采用前的变制反向后先采用前的,而采用前的变制。