In this paper, we propose a new reinforcement learning (RL) algorithm, called encoding distributional soft actor-critic (E-DSAC), for decision-making in autonomous driving. Unlike existing RL-based decision-making methods, E-DSAC is suitable for situations where the number of surrounding vehicles is variable and eliminates the requirement for manually pre-designed sorting rules, resulting in higher policy performance and generality. We first develop an encoding distributional policy iteration (DPI) framework by embedding a permutation invariant module, which employs a feature neural network (NN) to encode the indicators of each vehicle, in the distributional RL framework. The proposed DPI framework is proved to exhibit important properties in terms of convergence and global optimality. Next, based on the developed encoding DPI framework, we propose the E-DSAC algorithm by adding the gradient-based update rule of the feature NN to the policy evaluation process of the DSAC algorithm. Then, the multi-lane driving task and the corresponding reward function are designed to verify the effectiveness of the proposed algorithm. Results show that the policy learned by E-DSAC can realize efficient, smooth, and relatively safe autonomous driving in the designed scenario. And the final policy performance learned by E-DSAC is about three times that of DSAC. Furthermore, its effectiveness has also been verified in real vehicle experiments.
 翻译:在本文中,我们提出一个新的强化学习(RL)算法,称为编码软软体行为者-批评(E-DSAAC),用于自主驾驶的决策决策。与现有的基于RL的决策方法不同,E-DSAAC适合周围车辆数量变化不定的情况,并消除了人工预先设计的分类规则的要求,从而导致更高的政策业绩和一般性。我们首先通过嵌入一个变换式模块来开发一个编码分配政策迭代(DPI)框架,该模块使用一个特质神经网络(NNN)来编码分配RL框架中的每一车辆的指标。新闻部的拟议框架已证明在趋同和全球最佳性方面显示出重要的特性。接下来,根据已开发的编码DPI框架,我们提议E-DSAAC算法的计算方法,通过将基于梯度的更新规则添加到DSAC算法的政策评价过程。然后,多路驱动任务和相应的奖赏功能被设计为核查拟议的算法的有效性。结果显示,E-DSAAC所学的政策在实际操作性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、由E-DSAAC所设计的最后演进、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性能、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性、安全性能、安全性能、安全性、安全性、安全性能、安全性能。