We extend temporal-difference (TD) learning in order to obtain risk-sensitive, model-free reinforcement learning algorithms. This extension can be regarded as modification of the Rescorla-Wagner rule, where the (sigmoidal) stimulus is taken to be either the event of over- or underestimating the TD target. As a result, one obtains a stochastic approximation rule for estimating the free energy from i.i.d. samples generated by a Gaussian distribution with unknown mean and variance. Since the Gaussian free energy is known to be a certainty-equivalent sensitive to the mean and the variance, the learning rule has applications in risk-sensitive decision-making.
翻译:我们扩展了时间差异(TD)学习,以获得对风险敏感的、不使用模型的强化学习算法。这一扩展可被视为对Rescorla-Wagner规则的修改,在该规则中,(Sigmodil)刺激被认为是过度或低估TD目标的事件。结果,一个人获得了一种随机近似规则,用于估算由高斯分布产生的无能量(i.d.d.)的样本产生的无风险能量,且其平均值和差异不明。由于高斯自由能源已知是一种与平均值和差异相当的确定性,因此学习规则适用于风险敏感的决策。