所有数据都是必需的吗？基于轨迹熵最大化的自动驾驶大规模数据集高效剪枝方法 (Are All Data Necessary? Efficient Data Pruning for Large-scale Autonomous Driving Dataset via Trajectory Entropy Maximization)

Collecting large-scale naturalistic driving data is essential for training robust autonomous driving planners. However, real-world datasets often contain a substantial amount of repetitive and low-value samples, which lead to excessive storage costs and bring limited benefits to policy learning. To address this issue, we propose an information-theoretic data pruning method that effectively reduces the training data volume without compromising model performance. Our approach evaluates the trajectory distribution information entropy of driving data and iteratively selects high-value samples that preserve the statistical characteristics of the original dataset in a model-agnostic manner. From a theoretical perspective, we show that maximizing trajectory entropy effectively constrains the Kullback-Leibler divergence between the pruned subset and the original data distribution, thereby maintaining generalization ability. Comprehensive experiments on the NuPlan benchmark with a large-scale imitation learning framework demonstrate that the proposed method can reduce the dataset size by up to 40% while maintaining closed-loop performance. This work provides a lightweight and theoretically grounded approach for scalable data management and efficient policy learning in autonomous driving systems.

翻译：收集大规模自然驾驶数据对于训练鲁棒的自动驾驶规划器至关重要。然而，现实世界数据集通常包含大量重复且低价值的样本，这不仅导致过高的存储成本，而且对策略学习的提升作用有限。为解决这一问题，我们提出一种基于信息论的数据剪枝方法，能在不损害模型性能的前提下有效减少训练数据量。我们的方法通过评估驾驶数据的轨迹分布信息熵，以模型无关的方式迭代选择能够保持原始数据集统计特征的高价值样本。从理论角度，我们证明了最大化轨迹熵能有效约束剪枝子集与原始数据分布之间的Kullback-Leibler散度，从而保持泛化能力。在NuPlan基准测试中，基于大规模模仿学习框架的综合实验表明，所提方法能在保持闭环性能的同时，将数据集规模缩减高达40%。这项工作为自动驾驶系统中的可扩展数据管理和高效策略学习提供了一种轻量化且理论完备的解决方案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2026】TOFA：面向视觉-语言模型的免训练一次性联邦自适应方法

专知会员服务

12+阅读 · 11月23日

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

专知会员服务

13+阅读 · 2022年3月19日

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日

【CVPR2020】自监督的深度视觉测程与在线适应，Self-Supervised Deep Visual Odometry

专知会员服务

32+阅读 · 2020年5月14日