Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.
翻译:指称表达计数(REC)将类别级物体计数扩展至细粒度子类别级,旨在枚举与同时指定类别及区分性属性的文本表达相匹配的物体。然而,一个根本性挑战长期被忽视:标注点通常置于类别代表性位置(例如头部),迫使模型聚焦于类别级特征,而忽略了其他视觉区域(例如用于“行走”的腿部)的属性信息。为解决此问题,我们提出W2-Net,一种通过双查询机制将问题显式解耦为“计数对象”与“视觉关注区域”的新颖框架。具体而言,在标准化的定位对象的“计数对象”查询之外,我们引入了专用的“视觉关注区域”查询。后者被引导至属性相关的视觉区域进行特征搜寻与提取,从而实现精确的子类别区分。此外,我们提出了子类别可分离匹配(SSM),这是一种在标签分配过程中引入排斥力以增强子类别间可分离性的新型匹配策略。W2-Net在REC-8K数据集上显著优于现有最优方法,将计数误差降低了22.5%(验证集)和18.0%(测试集),并将定位F1分数分别提升了7%和8%。代码将公开提供。