Many perception systems in mobile computing, autonomous navigation, and AR/VR face strict compute constraints that are particularly challenging for high-resolution input images. Previous works propose nonuniform downsamplers that "learn to zoom" on salient image regions, reducing compute while retaining task-relevant image information. However, for tasks with spatial labels (such as 2D/3D object detection and semantic segmentation), such distortions may harm performance. In this work (LZU), we "learn to zoom" in on the input image, compute spatial features, and then "unzoom" to revert any deformations. To enable efficient and differentiable unzooming, we approximate the zooming warp with a piecewise bilinear mapping that is invertible. LZU can be applied to any task with 2D spatial input and any model with 2D spatial features, and we demonstrate this versatility by evaluating on a variety of tasks and datasets: object detection on Argoverse-HD, semantic segmentation on Cityscapes, and monocular 3D object detection on nuScenes. Interestingly, we observe boosts in performance even when high-resolution sensor data is unavailable, implying that LZU can be used to "learn to upsample" as well.
翻译:移动计算、自主导航和增强现实/虚拟现实中的许多感知系统面临着严格的计算限制,这对于高分辨率输入图像尤为具有挑战性。以前的工作提出了非均匀降采样器,它们“学习缩放”以便在显着的图像区域上减少计算,同时保留与任务相关的图像信息。然而,对于具有空间标签(例如2D/3D目标检测和语义分割)的任务,这种扭曲可能会损害性能。在这项工作中(LZU),我们“学习缩放”输入图像,计算空间特征,然后“非缩放”以恢复任何变形。为了实现高效且可微分的非缩放,我们使用分段双线性映射来近似缩放变换,该映射是可逆的。LZU可以应用于具有2D空间输入的任何任务以及具有2D空间特征的任何模型,我们通过在各种任务和数据集上进行评估来展示这种多功能性,如Argoverse-HD上的目标检测,Cityscapes上的语义分割以及nuScenes上的单目3D目标检测。有趣的是,即使高分辨率传感器数据不可用,我们也观察到性能提升,这意味着LZU也可以用于“学习上采样”。