This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model to minimize the cumulative regret with respect to the best intervention in hindsight. This is, naturally, posed as a causal bandit problem. The focus is on causal bandits for linear structural equation models (SEMs) and soft interventions. It is assumed that the graph's structure is known, and it has $N$ nodes. Two linear mechanisms, one soft intervention and one observational, are assumed for each node, giving rise to $2^N$ possible interventions. The existing causal bandit algorithms assume that at least the interventional distributions of the reward node's parents are fully specified. However, there are $2^N$ such distributions (one corresponding to each intervention), acquiring which becomes prohibitive even in moderate-sized graphs. This paper dispenses with the assumption of knowing these distributions. Two algorithms are proposed for the frequentist (UCB-based) and Bayesian (Thompson Sampling-based) settings. The key idea of these algorithms is to avoid directly estimating the $2^N$ reward distributions and instead estimate the parameters that fully specify the SEMs (linear in $N$) and use them to compute the rewards. In both algorithms, under boundedness assumptions on noise and the parameter space, the cumulative regrets scale as $\tilde{\cal O} ((2d)^L L \sqrt{T})$, where $d$ is the graph's maximum degree, and $L$ is the length of its longest causal path.
翻译:本文研究在因果图形模型中设计最佳干预序列的问题, 以尽量减少对事后最佳干预的累积遗憾。 这自然是一个因果关系问题。 重点是线性结构方程模型( SEMs) 和软性干预的因果强盗。 假设图的结构是已知的, 它有一美元节点。 假设每个节点都有两个线性机制, 一个软性干预和一个观察机制, 从而产生2美元可能的干预。 现有的因果运算法假定至少对奖赏节父母父母的干预分布作了充分说明。 然而, 此类分配为2美元( 与每次干预对应一个), 即使在中度图表中, 也变得令人望而望。 本文中包含了解这些分布的假设。 为常客( 以UCB为基础的) 和巴耶西亚( Tompson Sampling- basilate) 设置了两种算法。 这些算法的关键思想是避免直接估算2N美元奖项的评级分配, 以美元计价值为准, 和以 美元 折价 度 标度 标值 。