In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets.
翻译:在这项工作中,我们引入了一个交叉模式图像检索系统,允许将文字和草图作为查询的输入模式。设计了一个跨模式深度网络结构,以共同建模草图和文本输入模式以及图像输出模式,学习文本和图像之间以及草图和图像之间的共同嵌入。此外,还使用一个关注模型,有选择地将注意力集中在图像的不同对象上,允许在查询中用多个对象进行检索。实验显示,拟议方法在标准数据集中的单项和多项对象图像检索方面都表现最佳。