Efficient processing of high-res video streams is safety-critical for many robotics applications such as autonomous driving. To maintain real-time performance, many practical systems downsample the video stream. But this can hurt downstream tasks such as (small) object detection. Instead, we take inspiration from biological vision systems that allocate more foveal "pixels" to salient parts of the scene. We introduce FOVEA, an approach for intelligent downsampling that ensures salient image regions remain "magnified" in the downsampled output. Given a high-res image, FOVEA applies a differentiable resampling layer that outputs a small fixed-size image canvas, which is then processed with a differentiable vision module (e.g., object detection network), whose output is then differentiably backward mapped onto the original image size. The key idea is to resample such that background pixels can make room for salient pixels of interest. In order to ensure the overall pipeline remains efficient, FOVEA makes use of cheap and readily available cues for saliency, including dataset-specific spatial priors or temporal priors computed from object predictions in the recent past. On the autonomous driving datasets Argoverse-HD and BDD100K, our proposed method boosts the detection AP over standard Faster R-CNN, both with and without finetuning. Without any noticeable increase in compute, we improve accuracy on small objects by over 2x without degrading performance on large objects. Finally, FOVEA sets a new record for streaming AP (from 17.8 to 23.0 on a GTX 1080 Ti GPU), a metric designed to capture both accuracy and latency.
翻译:高频视频流的高效处理对于许多机器人应用( 如自动驾驶) 的安全至关重要 。 为了保持实时性能, 许多实用的系统都会对视频流进行下游模拟。 但这会伤害下游任务, 比如( 小) 对象检测。 相反, 我们从生物视觉系统中获取灵感, 将更多的微小“ 像素” 配置到场景的突出部分。 我们引入了智能的下游取样方法FOVEA, 这个方法可以确保突出的图像区域在下游输出中仍然“ 放大 ” 。 由于图像高频, FOVEA 应用了一个可区分的重塑层, 输出一个小型固定大小的图像画布, 之后用一个可区分的视觉模块( 如, 对象探测网络) 。 生物视觉系统的输出会有所不同地向后向原始图像大小。 我们引入了智能像素可以让突出的像素在下游输出输出中保持“ 放大 ” 。 为了确保整个管道的效率, FOVEA 能够使用一个廉价且容易获得的直线, 17 。 。 包括没有直观的精度的直径直径直径, 直径, 直径, 。