PQA: 概念问题回答 (PQA: Perceptual Question Answering)

Perceptual organization remains one of the very few established theories on the human visual system. It underpinned many pre-deep seminal works on segmentation and detection, yet research has seen a rapid decline since the preferential shift to learning deep models. Of the limited attempts, most aimed at interpreting complex visual scenes using perceptual organizational rules. This has however been proven to be sub-optimal, since models were unable to effectively capture the visual complexity in real-world imagery. In this paper, we rejuvenate the study of perceptual organization, by advocating two positional changes: (i) we examine purposefully generated synthetic data, instead of complex real imagery, and (ii) we ask machines to synthesize novel perceptually-valid patterns, instead of explaining existing data. Our overall answer lies with the introduction of a novel visual challenge -- the challenge of perceptual question answering (PQA). Upon observing example perceptual question-answer pairs, the goal for PQA is to solve similar questions by generating answers entirely from scratch (see Figure 1). Our first contribution is therefore the first dataset of perceptual question-answer pairs, each generated specifically for a particular Gestalt principle. We then borrow insights from human psychology to design an agent that casts perceptual organization as a self-attention problem, where a proposed grid-to-grid mapping network directly generates answer patterns from scratch. Experiments show our agent to outperform a selection of naive and strong baselines. A human study however indicates that ours uses astronomically more data to learn when compared to an average human, necessitating future research (with or without our dataset).

翻译：视觉组织仍然是人类视觉系统上为数不多的既定理论之一。它支持了许多关于分解和检测的先入为主的先入为主的理论, 但研究却看到自偏好转向深层模型以来, 快速下降。在有限的尝试中, 多数尝试都旨在用概念组织规则来解释复杂的视觉场景。然而, 事实证明, 这一点并不理想, 因为模型无法有效捕捉真实世界图像的视觉复杂性。在本文中, 我们通过倡导两种定位变化来恢复对概念组织的研究, 即:(一) 我们检查有意生成的合成数据, 而不是复杂的真实图像, 以及(二) 我们要求机器合成新颖的视觉有效模式, 而不是解释现有的数据。我们的总体答案在于引入新的视觉挑战 -- -- 感知问题回答的挑战。在观察视觉问答配对时, PQA 的目标是通过完全从抓取答案来解决相似的问题(见图1)。因此, 我们的第一个贡献是最初的视觉问题解答组合, 而不是复杂的真实图像, 但是我们请机器来综合新的视觉结构,, 每一个具体地展示一个人类的自我分析组织, 。