Multiscale video transformers have been explored in a wide variety of vision tasks. To date, however, the multiscale processing has been confined to the encoder or decoder alone. We present a unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in videos. Multiscale representation at both encoder and decoder yields key benefits of implicit extraction of spatiotemporal features (i.e. without reliance on input optical flow) as well as temporal consistency at encoding and coarseto-fine detection for high-level (e.g. object) semantics to guide precise localization at decoding. Moreover, we propose a transductive learning scheme through many-to-many label propagation to provide temporally consistent predictions. We showcase our Multiscale Encoder-Decoder Video Transformer (MED-VT) on Automatic Video Object Segmentation (AVOS) and actor/action segmentation, where we outperform state-of-the-art approaches on multiple benchmarks using only raw images, without using optical flow.
翻译:多尺度视频转换器已经在各种视觉任务中得到了广泛的探索。然而,迄今为止,多尺度处理仅限于编码器或解码器。我们提出了一种统一的多尺度编码器-解码器transformer,用于视频中的密集预测任务。编码器和解码器的多尺度表示可带来重要的好处:隐式提取时空特征(即无需依赖输入光流),编码时的时间一致性以及用于高级(例如对象)语义的从粗到细的检测,以指导解码时的精确定位。此外,我们提出了一种通过多对多标签传播提供时态一致预测的传导学习方案。我们在自动视频对象分割(AVOS)和演员/动作分割上展示了我们的多尺度编码器-解码器视频Transformer(MED-VT)。我们在多个基准测试中都使用原始图像(而不使用光流),并且在结果上超越了最先进的方法。