We are concerned with the problem of hyperparameter selection for the fitted Q-evaluation (FQE). FQE is one of the state-of-the-art method for offline policy evaluation (OPE), which is essential to the reinforcement learning without environment simulators. However, like other OPE methods, FQE is not hyperparameter-free itself and that undermines the utility in real-life applications. We address this issue by proposing a framework of approximate hyperparameter selection (AHS) for FQE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as distribution-mismatch tolerance and time complexity. We also confirm in experiments that the error bound given by the theory matches empirical observations.
翻译:我们所关注的是为适应的Q评价选择超参数的问题,FQE是离线政策评价的最先进方法之一,这是在没有环境模拟器的情况下加强学习的关键,然而,同其他OPE方法一样,FQE本身不是无超参数的,而且损害实际应用的效用。我们提出一个框架,为FQE选择近似超参数(AHS)来解决这个问题,它以定量和可解释的方式界定最佳性(所谓选择标准)的概念,而没有超参数。我们然后得出四种AHS方法,其中每一种都有不同的特征,如分布-分布式容忍度和时间复杂性。我们还在实验中确认,理论所约束的错误与经验观测相吻合。