This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model to minimize cumulative regret with respect to the best intervention in hindsight. This is, naturally, posed as a causal bandit problem. The focus is on causal bandits for linear structural equation models (SEMs) and soft interventions. It is assumed that the graph's structure is known and has $N$ nodes. Two linear mechanisms, one soft intervention and one observational, are assumed for each node, giving rise to $2^N$ possible interventions. Majority of the existing causal bandit algorithms assume that at least the interventional distributions of the reward node's parents are fully specified. However, there are $2^N$ such distributions (one corresponding to each intervention), acquiring which becomes prohibitive even in moderate-sized graphs. This paper dispenses with the assumption of knowing these distributions or their marginals. Two algorithms are proposed for the frequentist (UCB-based) and Bayesian (Thompson Sampling-based) settings. The key idea of these algorithms is to avoid directly estimating the $2^N$ reward distributions and instead estimate the parameters that fully specify the SEMs (linear in $N$) and use them to compute the rewards. In both algorithms, under boundedness assumptions on noise and the parameter space, the cumulative regrets scale as $\tilde{\cal O} (d^{L+\frac{1}{2}} \sqrt{NT})$, where $d$ is the graph's maximum degree, and $L$ is the length of its longest causal path. Additionally, a minimax lower of $\Omega(d^{\frac{L}{2}-2}\sqrt{T})$ is presented, which suggests that the achievable and lower bounds conform in their scaling behavior with respect to the horizon $T$ and graph parameters $d$ and $L$.
翻译:本文研究了如何在因果图模型中设计最优干预序列,以便最小化累计遗憾并针对最佳干预进行后见之明。这自然是一个因果bandit问题。重点是线性结构方程模型(SEMs)和软干预的因果 Bandits。假设已知图形的结构并且具有N个节点。为每个节点假定两个线性机制一个软干预和一个观察模式,从而产生2N种可能的干预方式。现有的大多数因果bandit算法假设至少干预奖励节点的父节点的分布已完全确定。但是,有2N个这样的分布(每个干预对应一个分布),即使在中等大小的图形中获取这些分布也变得禁止。本文不需要知道这些分布或它们的边际分布。提出了一种面向计数学(以多臂内容为基础)和贝叶斯(以汤普森取样为基础)设置的两个算法。这些算法的关键思想是避免直接估计2N个奖励分布,而是估计完全指定SEMs(线性)的参数并使用它们计算奖励。在两种算法中,在噪声和参数空间受限的假设下,累计遗憾的规模为$\tilde{\cal O} (d^{L+\frac{1}{2}} \sqrt{NT})$,其中d为图的最大度数,L为其最长的因果路径长度。此外,还提出了能够达到和下限都符合实际的、关于水平T和图形参数d和L的缩放行为的最小二分下限$\Omega(d^{\frac{L}{2}-2}\sqrt{T})$。