Finding Nash equilibrial policies for two-player differential games requires solving Hamilton-Jacobi-Isaacs PDEs. Recent studies achieved success in circumventing the curse of dimensionality in solving such PDEs with underlying applications to human-robot interactions (HRI), by adopting self-supervised (physics-informed) neural networks as universal value approximators. This paper extends from previous SOTA on zero-sum games with continuous values to general-sum games with discontinuous values, where the discontinuity is caused by that of the players' losses. We show that due to its lack of convergence proof and generalization analysis on discontinuous losses, the existing self-supervised learning technique fails to generalize and raises safety concerns in an autonomous driving application. Our solution is to first pre-train the value network on supervised Nash equilibria, and then refine it by minimizing a loss that combines the supervised data with the PDE and boundary conditions. Importantly, the demonstrated advantage of the proposed learning method against purely supervised and self-supervised approaches requires careful choice of the neural activation function: Among $\texttt{relu}$, $\texttt{sin}$, and $\texttt{tanh}$, we show that $\texttt{tanh}$ is the only choice that achieves optimal generalization and safety performance. Our conjecture is that $\texttt{tanh}$ (similar to $\texttt{sin}$) allows continuity of value and its gradient, which is sufficient for the convergence of learning, and at the same time is expressive enough (similar to $\texttt{relu}$) at approximating discontinuous value landscapes. Lastly, we apply our method to approximating control policies for an incomplete-information interaction and demonstrate its contribution to safe interactions.
翻译:找到 Nash { equilibrial 政策用于两个玩家差异游戏。 最近的研究成功地绕过了在人类机器人互动( HRI) 中应用自我监督( 物理知情) 神经网络作为通用值匹配器解决这种多维的诅咒。 本文从以前的SOTA 零和游戏的连续值延伸至具有不连续值的普通游戏, 其间不连续性是由玩家损失造成的。 我们表明, 由于缺乏对不连续损失的趋同证明和总体分析, 现有的自我监督学习技术无法在人类机器人互动( HRI) 中推广和提出安全关切。 我们的解决方案是首先将受监管的Nash 松软调的值网络进行前置, 然后通过最小化一个将监管数据与非正值和边界条件相结合的损失。 重要的一点是, 拟议的学习方法对于纯监管和自我监督的自我超超常化方法的优点, 需要谨慎地选择以美元货币货币的汇率激活功能 : 在 美元 中, 透明\\ t 时间 显示, 我们的汇率 将 我们的汇率 显示 我们的汇率 的汇率 和 显示, 我们的汇率的汇率 显示我们的汇率的汇率的 的 最优化 。