Event cameras sense the intensity changes asynchronously and produce event streams with high dynamic range and low latency. This has inspired research endeavors utilizing events to guide the challenging video superresolution (VSR) task. In this paper, we make the first attempt to address a novel problem of achieving VSR at random scales by taking advantages of the high temporal resolution property of events. This is hampered by the difficulties of representing the spatial-temporal information of events when guiding VSR. To this end, we propose a novel framework that incorporates the spatial-temporal interpolation of events to VSR in a unified framework. Our key idea is to learn implicit neural representations from queried spatial-temporal coordinates and features from both RGB frames and events. Our method contains three parts. Specifically, the Spatial-Temporal Fusion (STF) module first learns the 3D features from events and RGB frames. Then, the Temporal Filter (TF) module unlocks more explicit motion information from the events near the queried timestamp and generates the 2D features. Lastly, the SpatialTemporal Implicit Representation (STIR) module recovers the SR frame in arbitrary resolutions from the outputs of these two modules. In addition, we collect a real-world dataset with spatially aligned events and RGB frames. Extensive experiments show that our method significantly surpasses the prior-arts and achieves VSR with random scales, e.g., 6.5. Code and dataset are available at https: //vlis2022.github.io/cvpr23/egvsr.
翻译:事件相机异步感知强度变化,产生具有高动态范围和低延迟的事件流。这启发了利用事件引导复杂的视频超分辨率任务的研究工作。本文首次尝试使用事件的高时间分辨率特性,通过利用事件的空间-时间插值,来在任意比例下实现视频超分辨率。这受到了在引导视频超分辨率时,表示事件的时空信息的困难的限制。为此,我们提出了一个新的框架,将空间-时间事件插值与VSR统一起来。我们的关键思想是从灰度帧和事件的询问空间-时间坐标和特征中学习隐式神经表示。我们的方法包含三个部分。具体来说,空间-时间融合(STF)模块首先学习事件和RGB帧的3D特征。然后,时间滤波器(TF)模块从查询时间戳附近的事件中解锁更明显的运动信息,并生成2D特征。最后,空间-时间隐式表示(STIR)模块从这两个模块的输出中恢复任意分辨率的SR帧。此外,我们收集了一个具有空间对齐的事件和RGB帧的真实世界数据集。广泛的实验表明,我们的方法显著超越了先前的技术,并实现了任意比例的VSR,例如6.5。代码和数据集可在https://vlis2022.github.io/cvpr23/egvsr上获得。