Visual attributes constitute a large portion of information contained in a scene. Objects can be described using a wide variety of attributes which portray their visual appearance (color, texture), geometry (shape, size, posture), and other intrinsic properties (state, action). Existing work is mostly limited to study of attribute prediction in specific domains. In this paper, we introduce a large-scale in-the-wild visual attribute prediction dataset consisting of over 927K attribute annotations for over 260K object instances. Formally, object attribute prediction is a multi-label classification problem where all attributes that apply to an object must be predicted. Our dataset poses significant challenges to existing methods due to large number of attributes, label sparsity, data imbalance, and object occlusion. To this end, we propose several techniques that systematically tackle these challenges, including a base model that utilizes both low- and high-level CNN features with multi-hop attention, reweighting and resampling techniques, a novel negative label expansion scheme, and a novel supervised attribute-aware contrastive learning algorithm. Using these techniques, we achieve near 3.7 mAP and 5.7 overall F1 points improvement over the current state of the art. Further details about the VAW dataset can be found at http://vawdataset.com/.
翻译:视觉属性构成场景所含信息的一大部分。 对象的描述可以使用各种各样的属性来描述其视觉外观( 颜色、 纹理、 几何( 形状、 大小、 姿态) 和其他内在属性( 状态、 动作) 。 现有工作主要限于研究特定领域的属性预测。 在本文中, 我们为超过 260K 天体实例引入了大型的、 由 927K 以上天体属性附加说明组成的视觉属性预测数据集。 形式上, 对象属性预测是一个多标签分类问题, 必须对一个天体的所有属性进行预测。 我们的数据集对现有的方法提出了重大挑战, 原因是属性众多、 标签松散、 数据不平衡和 对象封闭等。 为此, 我们提出了一系列系统应对这些挑战的技术, 包括一个基础模型, 该模型利用具有多点关注的低层次和高层次CNN特征, 重新加权和抽取技术, 一种新型的负标签扩展方案, 以及一种新型的、 受监督的属性认知对比性学习算法。 使用这些技术, 我们实现了近3. 7 mAP 和 5.7 整体F1 点, 改进了当前数据/ VAVAW 的详细。