Autonomous driving requires efficient reasoning about the location and appearance of the different agents in the scene, which aids in downstream tasks such as object detection, object tracking, and path planning. The past few years have witnessed a surge in approaches that combine the different taskbased modules of the classic self-driving stack into an End-toEnd(E2E) trainable learning system. These approaches replace perception, prediction, and sensor fusion modules with a single contiguous module with shared latent space embedding, from which one extracts a human-interpretable representation of the scene. One of the most popular representations is the Birds-eye View (BEV), which expresses the location of different traffic participants in the ego vehicle frame from a top-down view. However, a BEV does not capture the chromatic appearance information of the participants. To overcome this limitation, we propose a novel representation that captures various traffic participants appearance and occupancy information from an array of monocular cameras covering 360 deg field of view (FOV). We use a learned image embedding of all camera images to generate a BEV of the scene at any instant that captures both appearance and occupancy of the scene, which can aid in downstream tasks such as object tracking and executing language-based commands. We test the efficacy of our approach on synthetic dataset generated from CARLA. The code, data set, and results can be found at https://rebrand.ly/APP OCC-results.
翻译:自主驾驶要求高效推理现场不同代理人的位置和外观,这需要对现场不同代理人的位置和外观进行高效的定位和外观的推理,这在物体探测、物体跟踪和路径规划等下游任务中有所助益。在过去几年里,一些方法激增,将经典自驾驶堆不同任务模块合并成一个全端到End-E2E(E2E)的可训练学习系统。这些方法用单一毗连模块取代感知、预测和感应聚变模块,以共享潜在空间嵌入的单一毗连模块取代感知、预测和感应聚合模块,从中提取出一个人类解释的场景图像。最受欢迎的演示之一是Birks-ey Ve View(Be-ey VeV),它从自上而下的角度显示不同交通参与者在自驾驶车辆框架中的位置。然而,BEVEV并没有捕捉到参与者的气色表,但是,BEV没有捕捉到参与者的外观的外观,我们从一个单摄像头上找到的图像图像图像。 我们的视野定位定位定位定位定位定位定位定位定位的图像,可以用来对图像进行追踪。