We propose replacing scene text in videos using deep style transfer and learned photometric transformations.Building on recent progress on still image text replacement,we present extensions that alter text while preserving the appearance and motion characteristics of the original video.Compared to the problem of still image text replacement,our method addresses additional challenges introduced by video, namely effects induced by changing lighting, motion blur, diverse variations in camera-object pose over time,and preservation of temporal consistency. We parse the problem into three steps. First, the text in all frames is normalized to a frontal pose using a spatio-temporal trans-former network. Second, the text is replaced in a single reference frame using a state-of-art still-image text replacement method. Finally, the new text is transferred from the reference to remaining frames using a novel learned image transformation network that captures lighting and blur effects in a temporally consistent manner. Results on synthetic and challenging real videos show realistic text trans-fer, competitive quantitative and qualitative performance,and superior inference speed relative to alternatives. We introduce new synthetic and real-world datasets with paired text objects. To the best of our knowledge this is the first attempt at deep video text replacement.
翻译:我们建议使用深样式传输和学习到的光度转换来取代视频中的现场文字。 在图像文本替换最新进展的基础上, 我们展示了文本在保留原始视频外观和运动特性的同时改变文本的扩展。 比较了图像文本替换问题, 我们的方法解决了视频带来的额外挑战, 即由于光线变化、 运动模糊、 相机- 目标的不同变化随时间变化而产生的影响, 以及保持时间一致性。 我们将问题分析成三个步骤。 首先, 所有框架中的文本都通过一个空洞- 时空跨端网络, 变成前方的图像。 其次, 文本在单一的参考框中被替换, 并使用一种状态的静态图像文本替换方法。 最后, 新的文本从引用中转换为剩余框架, 使用一个新颖的、 学习的图像转换网络, 以时间一致的方式捕捉到照明和模糊效应。 合成和具有挑战性真实视频的结果显示现实的文本转换、 竞争性的数量和质量性能, 以及相对于替代品的高级推导速度。 我们用配对文本引入新的合成和现实世界数据集。 我们的首项尝试是深处。