The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains underexplored by existing approaches. To address this gap, we propose Relational Language-Image Pre-training (RLIP), a strategy for contrastive pre-training that leverages both entity and relation descriptions. To make effective use of such pre-training, we make three technical contributions: (1) a new Parallel entity detection and Sequential relation inference (ParSe) architecture that enables the use of both entity and relation descriptions during holistically optimized pre-training; (2) a synthetic data generation framework, Label Sequence Extension, that expands the scale of language data available within each minibatch; (3) mechanisms to account for ambiguity, Relation Quality Labels and Relation Pseudo-Labels, to mitigate the influence of ambiguous/noisy samples in the pre-training data. Through extensive experiments, we demonstrate the benefits of these contributions, collectively termed RLIP-ParSe, for improved zero-shot, few-shot and fine-tuning HOI detection performance as well as increased robustness to learning from noisy annotations. Code will be available at https://github.com/JacobYuan7/RLIP.
翻译:为弥补这一差距,我们提议采用 " 人类-器官互动 " (HOI)探测目标,即对与环境相互作用的人类进行细微的视觉分析,从而能够广泛应用。先前的工作表明,有效的结构设计和整合相关线索,以便更准确地探测HOI,具有以下好处:为这项任务设计适当的培训前战略,但现有方法仍然未能充分探讨这一任务的适当培训前战略。为弥补这一差距,我们提议采用 " 关系上的语言模拟培训前(RLIP) " 战略,即利用实体和关系说明的对比性培训前培训前战略。为有效利用此类培训前,我们作出三项技术贡献:(1) 新的平行实体探测和序列关系推断(ParSe)架构,以便能够在整体优化前培训期间同时使用实体和关系说明;(2) 合成数据生成框架,即Label Serquece 扩展,以扩大每个小批内可用的语言数据的规模;(3) 考虑模糊性、质量拉贝尔和Relation Psedodo-la febel,以减缓方式减少在深度测试中进行广泛的模糊性/Labis 实验,以展示这些测试的深度和深度分析,以展示这些结果,以显示这些模糊性/精确性能的深度分析的深度和深度分析。