We for the first time extend multi-modal scene understanding to include that of free-hand scene sketches. This uniquely results in a trilogy of scene data modalities (sketch, text, and photo), where each offers unique perspectives for scene understanding, and together enable a series of novel scene-specific applications across discriminative (retrieval) and generative (captioning) tasks. Our key objective is to learn a common three-way embedding space that enables many-to-many modality interactions (e.g, sketch+text $\rightarrow$ photo retrieval). We importantly leverage the information bottleneck theory to achieve this goal, where we (i) decouple intra-modality information by minimising the mutual information between modality-specific and modality-agnostic components via a conditional invertible neural network, and (ii) align \textit{cross-modalities information} by maximising the mutual information between their modality-agnostic components using InfoNCE, with a specific multihead attention mechanism to allow many-to-many modality interactions. We spell out a few insights on the complementarity of each modality for scene understanding, and study for the first time a series of scene-specific applications like joint sketch- and text-based image retrieval, sketch captioning.
翻译:我们第一次将多式场景理解扩展为包括自由手场景图案。 这在现场数据模式( sketch、 文本和照片)三部模型中取得了独特的结果, 每一个模型都为现场理解提供了独特的视角, 并一起使得一系列新的场景特有应用能够跨越歧视( retrival) 和 基因化( 描述) 任务。 我们的关键目标是学习一个共同的三向嵌入空间, 使多种模式的相互作用( 例如, 草图+文本 $\rightrowr$ photive recrediction) 。 我们很重要的是利用信息瓶颈理论来实现这一目标, 我们( 一) 通过一个有条件的不可逆的神经网络, 和 基因化( 描述) (ii) 通过一个有条件的不可逆的神经网络, 校正 校正 校正, 校正 校正 校正, 校正, 通过一个基于多头的关注机制, 允许多种模式互动 。