In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal behavior policy and derive its regret with respect to the optimal behavior policy. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
翻译:在本文中,我们研究了为线性土匪的政策评估收集最佳数据的问题。在政策评估中,我们得到了一项目标政策,要求我们估计在正式确定为多武装土匪的环境中执行时,它将获得的预期累积报酬。在本文中,我们侧重于线性土匪设置,带有超强的奖励噪音。这是在线性土匪环境中,为涉及超强的奖励噪音的政策评估制定最佳数据收集战略的第一份工作。我们首先为超强的线性线性土匪设置中的加权最低平方估计设计了最佳设计,以减少目标政策的MSE。我们将此称为政策加权最低平方估计,并使用这一提法来得出数据收集的最佳行为政策。我们然后提出一个新的新的算法SPEEED(结构化政策评价实验设计),跟踪最佳行为政策,并对最佳行为政策表示遗憾。最后,我们从经验上证实SPEED导致政策评价,其中存在与目标战略相近的中度错误,远远低于仅仅执行目标政策。