This paper studies video inpainting detection, which localizes an inpainted region in a video both spatially and temporally. In particular, we introduce VIDNet, Video Inpainting Detection Network, which contains a two-stream encoder-decoder architecture with attention module. To reveal artifacts encoded in compression, VIDNet additionally takes in Error Level Analysis frames to augment RGB frames, producing multimodal features at different levels with an encoder. Exploring spatial and temporal relationships, these features are further decoded by a Convolutional LSTM to predict masks of inpainted regions. In addition, when detecting whether a pixel is inpainted or not, we present a quad-directional local attention module that borrows information from its surrounding pixels from four directions. Extensive experiments are conducted to validate our approach. We demonstrate, among other things, that VIDNet not only outperforms by clear margins alternative inpainting detection methods but also generalizes well on novel videos that are unseen during training.
翻译:本文研究视频图解检测, 将一个涂漆区域在空间和时间两个视频中本地化。 特别是, 我们引入VIDNet, 视频图解检测网络, 其中包含一个双流编码器解码器结构模块。 要揭示压缩中编码的文物, VIDNet还采用错误级别分析框架来增强 RGB 框架, 产生不同级别的多式特征 。 探索空间和时间关系, 这些特征被一个革命性 LSTM 进一步解码, 以预测涂漆区域的遮罩 。 此外, 在检测像素是否被涂漆过时, 我们展示了一个四方向从周围像素中借取信息的四位本地注意模块 。 为了验证我们的方法, 我们进行了广泛的实验。 除其他外, 我们证明VIDNet不仅通过明显的边际替代检测方法超越了外形, 而且还概括了培训期间看不见的新视频。