In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.
翻译:在本文中,我们考虑的是同时探测天体并在图像中推断其视觉特征的问题,即使那些在培训阶段没有提供人工说明的物体也是如此,我们考虑的是同时探测天体和在图像中推断其视觉特征的问题,类似于开放式词汇假设情景。为了实现这一目标,我们作出以下贡献:(一) 我们从一个天真的开放词汇天体探测和属性分类的两阶段着手,称为CLIP-Attr。 候选对象首先提出一个离线的 RPN,然后分类为语义分类和属性;(二) 我们将所有可用的数据集和培训与一个联合战略结合起来,对CLIP模型进行精细化,将视觉表述与属性协调,此外,我们调查在监管不力的学习下,利用可自由提供的在线图像配对的功效;(三) 为了提高效率,我们培训一个快速RCNNT型模式端对端对端端,对语义类别和语义类别进行分类,对语义分类和属性进行分类,与从强烈的文字编码生成的分类者一道,对CLIPIP模型模型模型模型模型模型模型模型进行精度模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型和属性模型和属性模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型进行广泛的测试,我们进行广泛的实验。