Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.
翻译:最近开发了先入为主的视觉语言模型,从而帮助识别只有语义分类的新对象,从而帮助识别只有语义分类的新对象,从而极大地推进了公开词汇对象的探测。先前的工作主要侧重于向对象建议分类进行知识转移,并采用类不可知的框和掩码预测。在这项工作中,我们提议CondHate,这是一个原则性的动态网络设计,目的是更好地概括用于开放词汇设置的盒式回归和掩码分割法。核心思想是有条件地将网络头以语义嵌入为条件参数,因此该模型以特定类别知识为指导,以更好地检测新类型。具体地说,CondHead由两个网络头流组成,即动态集成头和动态生成的头部组成。前者以一组固定头进行瞬间转换,有条件地汇总,这些头部优化为专家,并有望学习精密的预测。后者以动态生成的参数和普通类信息编码进行瞬间化。在这种条件设计中,检测模型由精密的分类嵌入式模型加以连接,以提供高度通用的类对象框和掩蔽式预测。我们的方法将大大改进了一套静态的静态头头头,仅以测试3号进行。