In a sequential decision-making problem, having a structural dependency amongst the reward distributions associated with the arms makes it challenging to identify a subset of alternatives that guarantees the optimal collective outcome. Thus, besides individual actions' reward, learning the causal relations is essential to improve the decision-making strategy. To solve the two-fold learning problem described above, we develop the 'combinatorial semi-bandit framework with causally related rewards', where we model the causal relations by a directed graph in a stationary structural equation model. The nodal observation in the graph signal comprises the corresponding base arm's instantaneous reward and an additional term resulting from the causal influences of other base arms' rewards. The objective is to maximize the long-term average payoff, which is a linear function of the base arms' rewards and depends strongly on the network topology. To achieve this objective, we propose a policy that determines the causal relations by learning the network's topology and simultaneously exploits this knowledge to optimize the decision-making process. We establish a sublinear regret bound for the proposed algorithm. Numerical experiments using synthetic and real-world datasets demonstrate the superior performance of our proposed method compared to several benchmarks.
翻译:在一个顺序决策问题中,在与武器有关的奖励分配中具有结构性依赖性,因此很难确定一系列保障最佳集体结果的替代方案。因此,除了个别行动奖励外,学习因果关系对于改进决策战略至关重要。为了解决上述双重学习问题,我们开发了“与因果相关奖励的交织半腰带框架”,通过固定结构等式模型的定向图表来模拟因果关系。图形信号中的节点观察包含相应的基臂即时奖励和其他基本武器奖励因果影响的附加术语。目标是最大限度地扩大长期平均报酬,这是基本武器奖励的线性功能,并在很大程度上取决于网络的地形。为了实现这一目标,我们提出了一项政策,通过学习网络的表层和同时利用这一知识来优化决策进程来确定因果关系。我们为拟议的算法建立了子线性遗憾。使用合成和真实世界数据组进行的数字实验显示了我们拟议方法的优异性业绩,与若干基准相比较。