Learning good interventions in a causal graph can be modelled as a stochastic multi-armed bandit problem with side-information. First, we study this problem when interventions are more expensive than observations and a budget is specified. If there are no backdoor paths from an intervenable node to the reward node then we propose an algorithm to minimize simple regret that optimally trades-off observations and interventions based on the cost of intervention. We also propose an algorithm that accounts for the cost of interventions, utilizes causal side-information, and minimizes the expected cumulative regret without exceeding the budget. Our cumulative-regret minimization algorithm performs better than standard algorithms that do not take side-information into account. Finally, we study the problem of learning best interventions without budget constraint in general graphs and give an algorithm that achieves constant expected cumulative regret in terms of the instance parameters when the parent distribution of the reward variable for each intervention is known. Our results are experimentally validated and compared to the best-known bounds in the current literature.
翻译:在因果图中学习良好的干预措施可以仿照以附带信息为模型的随机多武装匪徒问题。 首先,我们研究这个问题,当干预比观察更昂贵,并且指定了预算时,我们研究这个问题。如果从一个互交节点到奖赏节点没有后门路径,那么我们建议一种算法,以尽量减少对根据干预成本进行最佳交换的观察和干预措施的简单遗憾。 我们还建议一种算法,计算干预的成本,利用因果关系侧信息,并在不超出预算的情况下最大限度地减少预期的累积遗憾。我们的累计累累累最小化算法比标准算法要好,而没有考虑到附带信息。最后,我们研究的是没有一般图表中预算限制而学习最佳干预措施的问题,并给出一种算法,在知道每次干预的奖励变量的家长分布时,在实例参数方面实现预期的持续累积遗憾。我们的结果是实验性的,并且与目前文献中最著名的界限相比。