Vision-based Transformer have shown huge application in the perception module of autonomous driving in terms of predicting accurate 3D bounding boxes, owing to their strong capability in modeling long-range dependencies between the visual features. However Transformers, initially designed for language models, have mostly focused on the performance accuracy, and not so much on the inference-time budget. For a safety critical system like autonomous driving, real-time inference at the on-board compute is an absolute necessity. This keeps our object detection algorithm under a very tight run-time budget. In this paper, we evaluated a variety of strategies to optimize on the inference-time of vision transformers based object detection methods keeping a close-watch on any performance variations. Our chosen metric for these strategies is accuracy-runtime joint optimization. Moreover, for actual inference-time analysis we profile our strategies with float32 and float16 precision with TensorRT module. This is the most common format used by the industry for deployment of their Machine Learning networks on the edge devices. We showed that our strategies are able to improve inference-time by 63% at the cost of performance drop of mere 3% for our problem-statement defined in evaluation section. These strategies brings down Vision Transformers detectors inference-time even less than traditional single-image based CNN detectors like FCOS. We recommend practitioners use these techniques to deploy Transformers based hefty multi-view networks on a budge-constrained robotic platform.
翻译:摘要:视觉Transformer在预测准确的3D边框方面展现出强大的建模长程视觉特征依赖性的能力,并在自动驾驶的感知模块中得到广泛应用。然而,Transformer最初是为语言模型设计的,主要关注性能准确性,而不是推理时间预算。对于类似自动驾驶这样的安全关键系统,实时推理在车载计算中是必需的。这使我们的目标检测算法具有非常紧密的运行时间预算。在本文中,我们评估了各种策略来优化基于视觉Transformer的目标检测方法的推理时间,同时密切关注任何性能变化。我们选择的度量标准是精度-运行时间联合优化。此外,对于实际推理时间分析,我们使用TensorRT模块的float32和float16精度进行分析。这是工业界在边缘设备上部署他们的机器学习网络时最常用的格式。我们表明,我们的策略能够在成绩部分所定义的问题降低3%的情况下,将推理时间提高了63%。这些策略使Vision Transformer探测器的推理时间甚至比传统的单张图像CNN探测器如FCOS更短。我们建议实践者使用这些技术在预算受限制的机器人平台上部署基于Transformer的庞大的多视图网络。