The human eye consists of two types of photoreceptors, rods and cones. Rods are responsible for monochrome vision, and cones for color vision. The number of rods is much higher than the cones, which means that most human vision processing is done in monochrome. An event camera reports the change in pixel intensity and is analogous to rods. Event and color cameras in computer vision are like rods and cones in human vision. Humans can notice objects moving in the peripheral vision (far right and left), but we cannot classify them (think of someone passing by on your far left or far right, this can trigger your attention without knowing who they are). Thus, rods act as a region proposal network (RPN) in human vision. Therefore, an event camera can act as a region proposal network in deep learning Two-stage object detectors in deep learning, such as Mask R-CNN, consist of a backbone for feature extraction and a RPN. Currently, RPN uses the brute force method by trying out all the possible bounding boxes to detect an object. This requires much computation time to generate region proposals making two-stage detectors inconvenient for fast applications. This work replaces the RPN in Mask-RCNN of detectron2 with an event camera for generating proposals for moving objects. Thus, saving time and being computationally less expensive. The proposed approach is faster than the two-stage detectors with comparable accuracy
翻译:人眼由两种光感受器构成,杆细胞和锥细胞。杆细胞负责单色视觉,而锥细胞负责色彩视觉。杆细胞数量比锥细胞多得多,这意味着大多数人类视觉处理都是在单色视觉中完成的。事件相机报告像素强度的变化,类似于杆细胞。事件相机和彩色相机在计算机视觉中相当于人眼的杆细胞和锥细胞。人类可以注意到在周围视野(极右和极左)中移动的物体,但我们无法对其进行分类(想想在您的右侧或左侧经过的人,这可能会触发您的注意力,但不知道他们是谁)。因此,杆细胞在人类视觉中充当区域建议网络(RPN)。因此,事件相机可用作深度学习中的区域建议网络。深度学习中的两阶段物体检测器,如 Mask R-CNN,由特征提取的骨干和 RPN 组成。当前,RPN 使用 brute force 方法来尝试所有可能的边界框以检测对象。这需要大量计算时间来生成区域建议,使得两阶段检测器在快速应用中不方便。本研究将 detectron2 中 Mask-RCNN 的 RPN 替换为事件相机,以生成移动对象的建议,从而节省时间且计算成本较低。所提出的方法比具有可比较准确度的两阶段检测器更快。