Sometimes the meaning conveyed by images goes beyond the list of objects they contain; instead, images may express a powerful message to affect the viewers' minds. Inferring this message requires reasoning about the relationships between the objects, and general common-sense knowledge about the components. In this paper, we use a scene graph, a graph representation of an image, to capture visual components. In addition, we generate a knowledge graph using facts extracted from ConceptNet to reason about objects and attributes. To detect the symbols, we propose a neural network framework named SKG-Sym. The framework first generates the representations of the scene graph of the image and its knowledge graph using Graph Convolution Network. The framework then fuses the representations and uses an MLP to classify them. We extend the network further to use an attention mechanism which learn the importance of the graph representations. We evaluate our methods on a dataset of advertisements, and compare it with baseline symbolism classification methods (ResNet and VGG). Results show that our methods outperform ResNet in terms of F-score and the attention-based mechanism is competitive with VGG while it has much lower model complexity.
翻译:有时图像所传达的意思超出了其包含的对象列表; 相反,图像可能表达强烈的信息以影响观众的心智。 推断此信息时, 需要推理对象之间的关系, 以及对于组件的一般常识。 在本文中, 我们使用场景图( 图像的图形表示) 来捕捉视觉组件。 此外, 我们用从概念网中提取的事实来生成一个知识图, 来解释对象和属性。 为了检测符号, 我们提议了一个名为 SKG- Sym 的神经网络框架。 框架首先利用图集网络来生成图像的场景图及其知识图。 框架随后将表达方式合并, 并使用 MLP 来进行分类 。 我们进一步扩展网络, 以便使用关注机制来了解图形表达的重要性 。 我们评估广告数据集的方法, 并与基线符号分类方法( ResNet 和 VGGG) 进行比较 。 结果显示, 我们的方法在F- Score 和 关注机制方面优于 VGGG 。