In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.
翻译:在本文中,我们考虑通用视觉物体计数问题,目的是制定一个计算模式,用任意的“外光”数来计算任意的语义类物体的数量,即零光或几光计数。我们为此作出以下四项贡献:(1) 我们为通用视觉物体计数引入一个新型的变压器结构,称为“计数变换器”,明确反映图像补丁或与关注机制的“外光”之间的相似性;(2) 我们采用两阶段培训制度,先在自监督学习的模型中进行预演,然后进行有监督的微调;(3) 我们提出一个简单、可扩缩的管道,用于将培训图像与大量实例或不同语义类的图像合成,明确迫使模型使用给定的“外光”;(4) 我们对大型计数基准进行彻底的模拟研究,例如FSC-147,并展示零和几片设置的状态表现。