Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency. In practice, the visible cameras can better perceive texture details and slow motion, while event cameras can be free from motion blurs and have a larger dynamic range which enables them to work well under fast motion and low illumination. Therefore, the two sensors can cooperate with each other to achieve more reliable object tracking. In this work, we propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task. Our dataset consists of 820 video pairs captured under low illumination, high speed, and background clutter scenarios, and it is divided into a training and a testing subset, each of which contains 500 and 320 videos, respectively. Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods by extending current single-modality trackers into dual-modality versions. More importantly, we further build a simple but effective tracking algorithm by proposing a cross-modality transformer, to achieve more effective feature fusion between visible and event data. Extensive experiments on the proposed VisEvent dataset, and two simulated datasets (i.e., OTB-DVS and VOT-DVS), validated the effectiveness of our model. The dataset and source code will be available at our project page: \url{https://sites.google.com/view/viseventtrack/}.
翻译:与按框架记录强度图像框架的可见相机不同, 生物激励事件相机产生了一系列不同步和稀有的事件, 且时间长度要低得多。 实际上, 可见相机可以更好地看到纹理细节和慢动作, 而事件相机可以不受运动模糊的影响, 并且具有更大的动态范围, 使得它们能够在快速运动和低光度下运行良好。 因此, 两个传感器可以相互合作, 以便实现更可靠的天体跟踪。 在这项工作中, 我们提出一个大规模可见- 静态基准( 定时的 VisEvent ), 因为它缺少一个现实和规模化的数据集。 我们的数据集由820个视频组组成, 在低照明、 高速度和背景模糊的情景下捕捉到的图像, 而它可以分为一个培训和测试子集, 每个子集分别包含500 和 320 视频。 基于 VisEvent, 我们把事件流转换成事件模型, 并构建超过 30个基线方法, 通过将当前单位轨道跟踪器运行到双调版本。 更重要的是, 我们进一步构建一个简单、 可见度的数据轨 和透明的模型 数据系统 将数据转换到一个可操作 。