We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves $26.61\% $ $ AP $ on HICO-DET and $52.9\%$ $AP_{role}$ on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer .
翻译:我们建议HOI变异器以端到端的方式处理人体物体相互作用的探测。目前的方法要么将HOI的任务分解为物体探测和相互作用分类的不同阶段,要么引入代用互动问题。相反,我们的方法,即HOI变异器,通过消除对许多手工设计的部件的需求,简化HOI管道。HOI变异器从全球图像的角度对物体和人类关系的理由,并直接同时预测HOI事件。对HOI的预测采用五倍匹配损失的方法,以统一的方式强制HOI预测。我们的方法在概念上简单得多,显示准确性也有所提高。没有钟声和哨,HOI变异器在HICO-DET上实现了26.61美元AP美元,在V-CO上实现了52.9美元美元,比以前的方法简单得多。我们希望我们的方法能成为HOI任务的简单而有效的替代方法。代码可在https://github.com/bbepoch/Hoitransferent查阅。