As shown by recent studies, machine intelligence-enabled systems are vulnerable to test cases resulting from either adversarial manipulation or natural distribution shifts. This has raised great concerns about deploying machine learning algorithms for real-world applications, especially in the safety-critical domains such as autonomous driving (AD). On the other hand, traditional AD testing on naturalistic scenarios requires hundreds of millions of driving miles due to the high dimensionality and rareness of the safety-critical scenarios in the real world. As a result, several approaches for autonomous driving evaluation have been explored, which are usually, however, based on different simulation platforms, types of safety-critical scenarios, scenario generation algorithms, and driving route variations. Thus, despite a large amount of effort in autonomous driving testing, it is still challenging to compare and understand the effectiveness and efficiency of different testing scenario generation algorithms and testing mechanisms under similar conditions. In this paper, we aim to provide the first unified platform SafeBench to integrate different types of safety-critical testing scenarios, scenario generation algorithms, and other variations such as driving routes and environments. Meanwhile, we implement 4 deep reinforcement learning-based AD algorithms with 4 types of input (e.g., bird's-eye view, camera) to perform fair comparisons on SafeBench. We find our generated testing scenarios are indeed more challenging and observe the trade-off between the performance of AD agents under benign and safety-critical testing scenarios. We believe our unified platform SafeBench for large-scale and effective autonomous driving testing will motivate the development of new testing scenario generation and safe AD algorithms. SafeBench is available at https://safebench.github.io.
翻译:最近的研究显示,机器智能化系统很容易测试因对抗操纵或自然分配变化而产生的案件,这引起了人们对为现实世界应用,特别是自主驾驶(AD)等安全关键领域的应用部署机器学习算法的极大关切。另一方面,由于现实世界中安全临界情景的高度维度和罕见性,自然情景的传统自动测试需要数亿英里的驾驶力。因此,探索了几种自主驾驶评价方法,但通常基于不同的模拟平台、安全临界情景类型、情景生成算法和驱动路程变异。因此,尽管在自主驾驶测试方面做了大量努力,但比较并理解不同测试情景生成的效益和效率以及类似条件下的测试机制仍然很困难。在本文件中,我们的目标是提供第一个统一平台“安全贝辛奇”,以整合不同类别的安全临界测试情景、情景生成算法和其他变异,如驱动路径和环境。同时,我们实施了4种深度强化学习的基于安全-关键情景、情景生成算法和驱动路程变换路。因此,我们在4类自动驾驶测试中进行4种安全性演算法的大规模性演算。我们的安全度测试时,将进行安全级测试。我们的安全度测试。