Collecting large-scale naturalistic driving data is essential for training robust autonomous driving planners. However, real-world datasets often contain a substantial amount of repetitive and low-value samples, which lead to excessive storage costs and bring limited benefits to policy learning. To address this issue, we propose an information-theoretic data pruning method that effectively reduces the training data volume without compromising model performance. Our approach evaluates the trajectory distribution information entropy of driving data and iteratively selects high-value samples that preserve the statistical characteristics of the original dataset in a model-agnostic manner. From a theoretical perspective, we show that maximizing trajectory entropy effectively constrains the Kullback-Leibler divergence between the pruned subset and the original data distribution, thereby maintaining generalization ability. Comprehensive experiments on the NuPlan benchmark with a large-scale imitation learning framework demonstrate that the proposed method can reduce the dataset size by up to 40% while maintaining closed-loop performance. This work provides a lightweight and theoretically grounded approach for scalable data management and efficient policy learning in autonomous driving systems.
翻译:收集大规模自然驾驶数据对于训练鲁棒的自动驾驶规划器至关重要。然而,现实世界数据集通常包含大量重复且低价值的样本,这不仅导致过高的存储成本,而且对策略学习的提升作用有限。为解决这一问题,我们提出一种基于信息论的数据剪枝方法,能在不损害模型性能的前提下有效减少训练数据量。我们的方法通过评估驾驶数据的轨迹分布信息熵,以模型无关的方式迭代选择能够保持原始数据集统计特征的高价值样本。从理论角度,我们证明了最大化轨迹熵能有效约束剪枝子集与原始数据分布之间的Kullback-Leibler散度,从而保持泛化能力。在NuPlan基准测试中,基于大规模模仿学习框架的综合实验表明,所提方法能在保持闭环性能的同时,将数据集规模缩减高达40%。这项工作为自动驾驶系统中的可扩展数据管理和高效策略学习提供了一种轻量化且理论完备的解决方案。