Policy learning is a quickly growing area. As robotics and computers control day-to-day life, their error rate needs to be minimized and controlled. There are many policy learning methods and provable error rates that accompany them. We show an error or regret bound and convergence of the Deep Epsilon Greedy method which chooses actions with a neural network's prediction. In experiments with the real-world dataset MNIST, we construct a nonlinear reinforcement learning problem. We witness how with either high or low noise, some methods do and some do not converge which agrees with our proof of convergence.
翻译:政策学习是一个快速增长的领域。 随着机器人和计算机控制着日常生活,他们的错误率需要尽可能降低和控制。 随之而来的是许多政策学习方法和可证实的错误率。 我们展示了深Epsilon贪婪方法的错误或遗憾,并结合了该方法,该方法选择了与神经网络预测的行动。 在与现实世界数据集MNIST的实验中,我们构建了一个非线性强化学习问题。我们见证了高噪音或低噪音,有些方法与我们一致的证据不相容。