TakeAD：基于专家接管数据的端到端自动驾驶偏好后优化方法 (TakeAD: Preference-based Post-optimization for End-to-end Autonomous Driving with Expert Takeover Data)

Existing end-to-end autonomous driving methods typically rely on imitation learning (IL) but face a key challenge: the misalignment between open-loop training and closed-loop deployment. This misalignment often triggers driver-initiated takeovers and system disengagements during closed-loop execution. How to leverage those expert takeover data from disengagement scenarios and effectively expand the IL policy's capability presents a valuable yet unexplored challenge. In this paper, we propose TakeAD, a novel preference-based post-optimization framework that fine-tunes the pre-trained IL policy with this disengagement data to enhance the closed-loop driving performance. First, we design an efficient expert takeover data collection pipeline inspired by human takeover mechanisms in real-world autonomous driving systems. Then, this post optimization framework integrates iterative Dataset Aggregation (DAgger) for imitation learning with Direct Preference Optimization (DPO) for preference alignment. The DAgger stage equips the policy with fundamental capabilities to handle disengagement states through direct imitation of expert interventions. Subsequently, the DPO stage refines the policy's behavior to better align with expert preferences in disengagement scenarios. Through multiple iterations, the policy progressively learns recovery strategies for disengagement states, thereby mitigating the open-loop gap. Experiments on the closed-loop Bench2Drive benchmark demonstrate our method's effectiveness compared with pure IL methods, with comprehensive ablations confirming the contribution of each component.

翻译：现有端到端自动驾驶方法通常依赖于模仿学习（IL），但面临一个关键挑战：开环训练与闭环部署之间的不匹配。这种不匹配往往在闭环执行过程中引发驾驶员发起的接管和系统脱离。如何利用这些脱离场景中的专家接管数据，并有效扩展IL策略的能力，是一个有价值但尚未被充分探索的挑战。本文提出TakeAD，一种新颖的基于偏好的后优化框架，该框架利用脱离数据对预训练的IL策略进行微调，以提升闭环驾驶性能。首先，我们设计了一种高效的专家接管数据收集流程，其灵感来源于现实世界自动驾驶系统中的人类接管机制。随后，该后优化框架将用于模仿学习的迭代数据集聚合（DAgger）与用于偏好对齐的直接偏好优化（DPO）相结合。DAgger阶段通过直接模仿专家干预，使策略获得处理脱离状态的基本能力。紧接着，DPO阶段进一步优化策略行为，使其在脱离场景中更好地符合专家偏好。通过多次迭代，策略逐步学习针对脱离状态的恢复策略，从而缓解开环差距。在闭环Bench2Drive基准测试上的实验证明了本方法相较于纯IL方法的有效性，全面的消融实验也验证了各组件的贡献。