Sensor fusion can significantly improve the performance of many computer vision tasks. However, traditional fusion approaches are either not data-driven and cannot exploit prior knowledge nor find regularities in a given dataset or they are restricted to a single application. We overcome this shortcoming by presenting a novel deep hierarchical variational autoencoder called FusionVAE that can serve as a basis for many fusion tasks. Our approach is able to generate diverse image samples that are conditioned on multiple noisy, occluded, or only partially visible input images. We derive and optimize a variational lower bound for the conditional log-likelihood of FusionVAE. In order to assess the fusion capabilities of our model thoroughly, we created three novel datasets for image fusion based on popular computer vision datasets. In our experiments, we show that FusionVAE learns a representation of aggregated information that is relevant to fusion tasks. The results demonstrate that our approach outperforms traditional methods significantly. Furthermore, we present the advantages and disadvantages of different design choices.
翻译:感官聚合可以大大改善许多计算机视觉任务的业绩。 但是,传统的聚合方法不是数据驱动的,不能利用先前的知识,也不能在给定的数据集中找到规律性,或者它们仅限于一个应用程序。 我们通过展示一个叫做FusionVaE的新型的等级分层自动编码器来克服这一缺陷,它可以作为许多聚合任务的基础。 我们的方法能够产生以多声、隐蔽或仅部分可见输入图像为条件的不同图像为条件的图像样本。 我们产生并优化了对条件的熔融VAE日志相似性的变式下限。 为了彻底评估我们模型的融合能力,我们创造了三个基于流行计算机视觉数据集的图像融合新数据集。 我们在实验中显示,FusionVAE学会了与聚合任务相关的综合信息。 结果表明,我们的方法大大超越了传统方法。 此外,我们介绍了不同设计选择的利弊。