This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
翻译:本文从概念上简单而简单地扩展了蒙面自动涂层(MAE), 从视频中学习时空代表。 我们随机在视频中隐藏时空补丁, 并学习一个自动编码器, 以像素形式重建它们。 有趣的是, 我们的MAE方法可以学到强烈的表层, 在时空( 只能贴补和定位嵌入) 几乎没有感应偏差, 空间时空随机遮罩能发挥最佳效果。 我们观察到, 最佳遮蔽率高达90% ( 图像上为75% ), 支持这一比率与数据信息冗余有关的假设。 高掩蔽率会导致大超速, 例如, 在墙上或更长时间超过4x 。 我们用香草视觉变异器报告一些具有挑战性的视频数据集的竞争结果。 我们观察到, MAE 可以在大范围内完成受监督的预培训。 我们进一步报告鼓励在现实世界上开展培训的结果, 未加固的Instagram数据。 我们的研究显示, 以最小的域化自动代表方法( BER MAE, 等) 可以采用最小化的通用的自动代表方法。