通过强化学习对内层大气导弹进行整合和适应性指导和控制 (Integrated and Adaptive Guidance and Control for Endoatmospheric Missiles via Reinforcement Learning)

from arxiv, This is version 2, where performance was substantially improved due to a bug fix in the seeker code. This version also looks at performance using an IR seeker

We apply the meta reinforcement learning framework to optimize an integrated and adaptive guidance and flight control system for an air-to-air missile, implementing the system as a deep recurrent neural network (the policy). The policy maps observations directly to commanded rates of change for the missile's control surface deflections, with the observations derived with minimal processing from the computationally stabilized line of sight unit vector measured by a strap down radar seeker, estimated rotational velocity from rate gyros, and control surface deflection angles. The system induces intercept trajectories against a maneuvering target that satisfy control constraints on fin deflection angles, and path constraints on look angle and load. We test the optimized system in a six degrees-of-freedom simulator that includes a non-linear radome model and a strapdown seeker model. Through extensive simulation, we demonstrate that the system can adapt to a large flight envelope and off nominal flight conditions that include perturbation of aerodynamic coefficient parameters and center of pressure locations. Moreover, we find that the system is robust to the parasitic attitude loop induced by radome refraction, imperfect seeker stabilization, and sensor scale factor errors. Importantly, we compare our system's performance to a longitudinal model of proportional navigation coupled with a three loop autopilot, and find that our system outperforms the benchmark by a large margin. Additional experiments investigate the impact of removing the recurrent layer from the policy and value function networks, and performance with an infrared seeker.

翻译：我们运用元加强学习框架,优化对空对空导弹的综合适应性指导和飞行控制系统,将该系统作为深层的经常性神经网络(政策)加以实施。政策地图观测直接针对导弹控制表面偏转的指令变化率进行直接测量,包括非线性雷达模型和扣带式搜索模型等从计算稳定线的视控单位矢量中微小处理得出的观测。通过广泛的模拟,我们证明该系统可以适应大型飞行包和表面偏转角度,包括空气动力系数参数和压力位置中心受到扰动。此外,我们发现该系统对于满足对角偏移角度的控制限制和视角和负荷路径限制的操纵目标具有强大的控制性能。我们用六度自由自由度模拟器测试优化系统,其中包括非线性雷达模型和扣带式搜索模型。我们通过一个不完善的轨迹定位定位系统,通过一个不完善性能、不完善的轨迹定的轨迹定位系统,通过一个不精确的轨迹定位模型,以及一个不精确的轨迹标定的系统,通过一个不精确的轨迹标定的系统,通过一个不完善的轨定的轨迹定位的系统,通过一个比重的系统, 对比的系统,一个不完善的系统,一个比重的系统,一个比重的系统, 找到一个不精确的系统,一个不精确的系统,一个不精确的轨道的操作的操作的系统,一个比差的系统, 和方向的校差的校差的校差的系统,我们的校差的校差的校差的校差的系统,我们的系统,我们的系统,我们的系统, 的校差的操作的操作的操作的系统, 的校差的系统, 的校差的系统, 的系统,我们的系统,我们的系统,我们的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的系统,我们的定位的系统, 的系统, 的系统,我们的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的定位的系统, 和方向的定位的定位的定位的定位的定位的比。