Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency. In practice, the visible cameras can better perceive texture details and slow motion, while event cameras can be free from motion blurs and have a larger dynamic range which enables them to work well under fast motion and low illumination. Therefore, the two sensors can cooperate with each other to achieve more reliable object tracking. In this work, we propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task. Our dataset consists of 820 video pairs captured under low illumination, high speed, and background clutter scenarios, and it is divided into a training and a testing subset, each of which contains 500 and 320 videos, respectively. Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods by extending current single-modality trackers into dual-modality versions. More importantly, we further build a simple but effective tracking algorithm by proposing a cross-modality transformer, to achieve more effective feature fusion between visible and event data. Extensive experiments on the proposed VisEvent dataset, FE108, and two simulated datasets (i.e., OTB-DVS and VOT-DVS), validated the effectiveness of our model. The dataset and source code have been released at our project page: \url{https://sites.google.com/view/viseventtrack/}.
翻译:与按框架记录强度图像框架的可见相机不同, 生物激励事件相机产生了一系列不同步和稀有的事件, 其长度要低得多。 实际上, 可见相机可以更好地看到纹理细节和慢动作, 而事件相机可以不受运动模糊的影响, 并且具有更大的动态范围, 使得它们能够在快速运动和低光度下运行良好。 因此, 两个传感器可以相互合作, 以便实现更可靠的天体跟踪。 在这项工作中, 我们提出一个大规模可见- 静地基准( 定时的 VisEvent), 因为它缺少一个现实的和规模化的数据集。 我们的数据集由820个视频组组成, 在低照明、 高速度和背景模糊的情景下捕捉到的图像, 而它可以分为一个培训和测试子集, 每个子集分别包含500和320个视频。 基于 VisEvent, 我们把事件模式转换成事件模型, 并构建了超过30个基线方法, 通过将当前单一模式追踪器运行到双模版本。 更重要的是, 我们进一步构建一个简单的、 可见化的模型化的模型, 数据运行到一个简单的模型, 数据系统, 通过一个简单的模型, 在两个模型中, 我们的模型中, 显示中, 显示一个简单的模型中, 我们的模型中, 我们的模型中, 将一个简单的数据运行到一个简单的模型, 我们的模型, 将一个简单的数据 显示一个简单的模型, 的系统, 的系统, 通过一个有效的数据转换到一个。