We present simple video-specific autoencoders that enables human-controllable video exploration. This includes a wide variety of analytic tasks such as (but not limited to) spatial and temporal super-resolution, spatial and temporal editing, object removal, video textures, average video exploration, and correspondence estimation within and across videos. Prior work has independently looked at each of these problems and proposed different formulations. In this work, we observe that a simple autoencoder trained (from scratch) on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks. Our tasks are enabled by two key observations: (1) latent codes learned by the autoencoder capture spatial and temporal properties of that video and (2) autoencoders can project out-of-sample inputs onto the video-specific manifold. For e.g. (1) interpolating latent codes enables temporal super-resolution and user-controllable video textures; (2) manifold reprojection enables spatial super-resolution, object removal, and denoising without training for any of the tasks. Importantly, a two-dimensional visualization of latent codes via principal component analysis acts as a tool for users to both visualize and intuitively control video edits. Finally, we quantitatively contrast our approach with the prior art and found that without any supervision and task-specific knowledge, our approach can perform comparably to supervised approaches specifically trained for a task.
翻译:我们展示了便于人控制的视频探索的简单视频专用自动解码器,其中包括多种分析任务,例如(但不限于)空间和时间超分辨率、空间和时间超分辨率、空间和时间编辑、物体删除、视频纹理、平均视频探索以及视频内部和视频之间的通信估计。以前的工作独立地审视了这些问题中的每一个问题,并提出了不同的配方。在这项工作中,我们观察到,在特定视频的多个框架上受过培训的简单自动解码器(从零开始)能够执行大量视频处理和编辑任务。我们的任务由两项关键观察促成:(1)自动解码器所学的隐含代码能够捕捉到该视频的空间和时间特性,空间和时间超分辨率,空间解码器可以捕捉到该视频的时空特性,(2)自动解析器可以预测到该视频的时空代码的时空特性,通过主要部分分析,自动解码可以预测出对视频图案的外输入内容。例如(1) 潜在代码可以进行时间超分辨率和用户的调控控用视频文本;(2) 多重再预测,可以在任何任务中进行空间超分辨率解析和不经过训练的对用户进行定量分析。