Policy learning is a quickly growing area. As robotics and computers control day-to-day life, their error rate needs to be minimized and controlled. There are many policy learning methods and bandit methods with provable error rates that accompany them. We show an error or regret bound and convergence of the Deep Epsilon Greedy method which chooses actions with a neural network's prediction. We also show that Epsilon Greedy method regret upper bound is minimized with cubic root exploration. In experiments with the real-world dataset MNIST, we construct a nonlinear reinforcement learning problem. We witness how with either high or low noise, some methods do and some do not converge which agrees with our proof of convergence.
翻译:政策学习是一个快速增长的领域。 随着机器人和计算机控制着日常生活,他们的错误率需要最小化和控制。 许多政策学习方法和土匪方法随其而出现可辨别的错误率。 我们展示了深 Epsilon贪婪方法的错误或遗憾,并结合了该方法,该方法选择了神经网络预测的行动。 我们还显示, Epsilon贪婪方法以立方根探索的方式对上界后悔最小化。 在与真实世界数据集MNIST的实验中,我们构建了一个非线性强化学习问题。 我们见证了高噪音或低噪音,有些方法与我们一致的证据不相容。