Leveraging large-scale data can introduce performance gains on many computer vision tasks. Unfortunately, this does not happen in object detection when training a single model under multiple datasets together. We observe two main obstacles: taxonomy difference and bounding box annotation inconsistency, which introduces domain gaps in different datasets that prevents us from joint training. In this paper, we show that these two challenges can be effectively addressed by simply adapting object queries on language embedding of categories per dataset. We design a detection hub to dynamically adapt queries on category embedding based on the different distributions of datasets. Unlike previous methods attempted to learn a joint embedding for all datasets, our adaptation method can utilize the language embedding as semantic centers for common categories, while learning the semantic bias towards specific categories belonging to different datasets to handle annotation differences and make up the domain gaps. These novel improvements enable us to end-to-end train a single detector on multiple datasets simultaneously to fully take their advantages. Further experiments on joint training on multiple datasets demonstrate the significant performance gains over separate individual fine-tuned detectors.
翻译:利用大型数据可以在许多计算机愿景任务中引入绩效收益。 不幸的是, 在多个数据集下培训单一模型时, 对象检测中不会出现这样的效果。 我们观察到两个主要障碍: 分类差异和捆绑框批注不一致, 这使得不同数据集存在领域差距, 从而无法进行联合培训。 在本文中, 我们显示, 这两项挑战可以通过简单的调整对每个数据集中语言嵌入类别的对象查询来有效解决。 我们设计了一个检测中心, 以动态调整基于不同分布的数据集嵌入类别的查询。 与以往试图学习所有数据集联合嵌入的方法不同, 我们的适应方法可以使用语言嵌入为通用分类的语义中心, 同时学习对属于不同数据集的特定类别的语义偏差的语义偏差, 以便处理批注差异, 弥补域间差距。 这些新改进使我们能够在多个数据集上进行终端到终端的单个检测器培训, 以充分利用其优势。 与以往尝试对所有数据集进行联合培训的不同方法不同, 我们的实验显示不同的个人微调检测器在多个数据集上取得了显著的性效果。