A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.
翻译:强化学习和最佳控制方面的默认假设是,经验是在固定时钟周期的离散时间点得出的。但是,许多应用都涉及连续系统,时间离散不是固定的,而是可以通过学习算法管理。通过分析对有限和无穷分松环境的LQR系统的蒙特-卡洛价值估计,我们发现近似值和数值估计的统计错误之间有一个基本的权衡。重要的是,这两个错误在时间离散方面表现得不同,这意味着对时间解析有一个最佳的选择,即时间解析取决于数据预算。这些结果显示,根据有限数据调整时间解析方法可以大大改善LQR系统的价值估计质量。我们随机地展示了LQR实例和若干非线性环境的数字模拟的权衡。