This effort is focused on examining the behavior of reinforcement learning systems in personalization environments and detailing the differences in policy entropy associated with the type of learning algorithm utilized. We demonstrate that Policy Optimization agents often possess low-entropy policies during training, which in practice results in agents prioritizing certain actions and avoiding others. Conversely, we also show that Q-Learning agents are far less susceptible to such behavior and generally maintain high-entropy policies throughout training, which is often preferable in real-world applications. We provide a wide range of numerical experiments as well as theoretical justification to show that these differences in entropy are due to the type of learning being employed.
翻译:这项工作的重点是审查个人化环境中强化学习系统的行为,并详细说明与所用学习算法类型相关的政策变异性,我们证明政策优化机构在培训期间往往拥有低耗水性政策,这实际上导致代理人优先采取某些行动,避免采取其他行动。相反,我们还表明,在培训过程中,学习机构对此种行为的影响要小得多,一般保持高耗水性政策,在现实世界应用中,这种政策往往更可取。我们提供了广泛的数字实验和理论理由,以表明这些变温性差异是由于学习类型造成的。