We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (R-net) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Experiments on the Caltech Pedestrians dataset show that our approach reduces the number of processed pixels by over 50% without a drop in detection accuracy. The merits of our approach become more significant on a high resolution test set collected from YFCC100M dataset, where our approach maintains high detection performance while reducing the number of processed pixels by about 70% and the detection time by over 50%.
翻译:我们引入了一个通用框架,降低物体探测的计算成本,同时保留高分辨率图像中出现不同大小物体的假设情景的准确性; 以粗略到直线的方式检测进展, 首先是图像的下标版本, 然后是被确定有可能提高探测准确性的高分辨率区域序列。 在强化学习后, 我们的方法包括一个模型( R-net ), 该模型使用粗略检测结果来预测以更高分辨率分析一个区域的潜在准确性收益, 另一种模型( Q- net ), 该模型依次选择区域到缩放 。 在 Caltech Pedestrians 数据集上进行的实验显示, 我们的方法将加工的像素数量减少了50%以上, 而没有检测准确性下降。 我们的方法的优点在从 YFCC100M 数据集收集的高分辨率测试集中变得更重要, 我们的方法保持高探测性能,同时将加工像素的数量减少70%,探测时间减少50%以上。