The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-agnostic method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. We incorporate additional regularizers to improve the feature diversity of the synthesized videos as well as the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished.
翻译:深度学习模型的成功导致了其被著名视频理解方法的采用和普及。其中大多数方法为联合空间-时间模态中的特征编码,其内部工作和学习表示难以进行视觉解释。我们提出了一种被称为“LEAPS”的构架无关方法来从模型的内部空间 - 时间表示中合成视频。我们使用刺激视频和目标类别来激活一个固定的时空模型,并通过随机噪声初始化视频进行迭代优化。我们还加入了额外的正则化器来提高合成视频的特征多样性以及运动的帧间时序一致性。我们定量和定性地评估了LEAPS的适用性,通过反演在Kinetics-400上训练的一系列空间 - 时间卷积和注意力模型,这在我们所知道的范围内以前未曾实现。