Training vision-based autonomous driving in the real world can be inefficient and impractical. Vehicle simulation can be used to learn in the virtual world, and the acquired skills can be transferred to handle real-world scenarios more effectively. Between virtual and real visual domains, common features such as relative distance to road edges and other vehicles over time are consistent. These visual elements are intuitively crucial for human decision making during driving. We hypothesize that these spatio-temporal factors can also be used in transfer learning to improve generalization across domains. First, we propose a CNN+LSTM transfer learning framework to extract the spatio-temporal features representing vehicle dynamics from scenes. Next, we conduct an ablation study to quantitatively estimate the significance of various features in the decisions of driving systems. We observe that physically interpretable factors are highly correlated with network decisions, while representational differences between scenes are not. Finally, based on the results of our ablation study, we propose a transfer learning pipeline that uses saliency maps and physical features extracted from a source model to enhance the performance of a target model. Training of our network is initialized with the learned weights from CNN and LSTM latent features (capturing the intrinsic physics of the moving vehicle w.r.t. its surroundings) transferred from one domain to another. Our experiments show that this proposed transfer learning framework better generalizes across unseen domains compared to a baseline CNN model on a binary classification learning task.
翻译:在现实世界中,基于培训的自主自主驾驶在现实世界中可能是低效和不切实际的。 汽车模拟可以用来在虚拟世界中学习, 获得的技能可以转让, 以便更有效地处理现实世界情景。 在虚拟和真实的视觉领域之间, 共同的特征, 如相对距离路边和其他车辆随时间推移是一致的。 这些视觉要素对驾驶过程中的人类决策具有直观的至关重要性。 我们假设这些片状时空因素也可以用于转移学习, 从而改进跨域的概括化。 首先, 我们提议建立一个CNN+LSTM传输学习框架, 以从场景中提取代表车辆动态的时空特征。 下一步, 我们进行一项通缩研究, 以量化估计驾驶系统决策中各种特征的重要性。 我们注意到, 物理解释因素与网络决策密切相关, 而场面之间则没有差异。 最后,我们根据我们的关系研究的结果, 我们建议使用一个模型学习管道, 从源模型中提取突出的地图和物理特征来提升目标模型的性能。 我们的网络培训工作将比我们所学的精度从一个核心域域域图和LTM的学习模型, 将一个从一个学习模型转换到另一个轨道。