Diagram object detection is the key basis of practical applications such as textbook question answering. Because the diagram mainly consists of simple lines and color blocks, its visual features are sparser than those of natural images. In addition, diagrams usually express diverse knowledge, in which there are many low-frequency object categories in diagrams. These lead to the fact that traditional data-driven detection model is not suitable for diagrams. In this work, we propose a gestalt-perception transformer model for diagram object detection, which is based on an encoder-decoder architecture. Gestalt perception contains a series of laws to explain human perception, that the human visual system tends to perceive patches in an image that are similar, close or connected without abrupt directional changes as a perceptual whole object. Inspired by these thoughts, we build a gestalt-perception graph in transformer encoder, which is composed of diagram patches as nodes and the relationships between patches as edges. This graph aims to group these patches into objects via laws of similarity, proximity, and smoothness implied in these edges, so that the meaningful objects can be effectively detected. The experimental results demonstrate that the proposed GPTR achieves the best results in the diagram object detection task. Our model also obtains comparable results over the competitors in natural image object detection.
翻译:图形对象检测是教科书答题等实用应用的关键基础。 由于图表主要由简单的线条和颜色块组成, 其视觉特征比自然图像的颜色块更稀少。 此外, 图表通常表达不同的知识, 图表中有许多低频对象类别。 由此导致传统的数据驱动的检测模型不适合图表。 在这项工作中, 我们提议了一个用于图表对象检测的Gestalt- 感知变异器模型, 该模型以编码器- 脱coder 结构为基础。 凝光仪包含一系列法律来解释人类的感知, 人类视觉系统往往在类似、 接近或连接的图像中看到补丁, 而没有突然的方向变化, 在图表中存在许多低频对象类别。 由这些想法所启发的, 我们在变异形器摄像仪中建立一个Gestalt- 感知变图。 该图由图表的节点组成, 以及相近端点之间的关系。 该图旨在将这些相近点组合成一系列的物体, 通过相似、 接近和平滑度法来解释人类感知, 人类视觉系统会看到相近、 或相近或相近相连接的图像的图像结果 。 因此, 我们的图像检测中可以有效地检测到我们最有意义的图像中的图像结果 。