Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing -- something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image-image) and CUB (image-text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
翻译:多式VAEs 试图对不同数据(例如: 愿景、语言)的联合分布进行模型化,同时捕捉不同模式的共同代表性。以前的工作通常通过通过明确的产品、混合物或其他此类因子化,在识别模型中直接调和特异性代表形式,将不同模式的信息综合起来。这里我们引入了一种新型的替代方法,MEME,通过重新定位半监督VAE,避免通过相互监督将不同模式之间隐含的信息合并起来。这种配方自然允许从部分观测到的某些模式完全缺失的数据中学习 -- -- 多数现有方法要么无法处理,要么只在有限程度上处理。我们证明MEME在MNIST-SVHN(image-image-image)和CUB(image-text)数据集的局部和完整观测计划中,都超越了标准指标的基线。我们还对比了通过相互监督与标准方法相比所学到的表述质量,并观察到其获取数据关联性能力的有趣趋势。