Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts.
翻译:在本文中,我们提议一项新的一般性的参考理解任务,称为多对象跟踪(RMOT),其核心思想是使用一种语言表达方式作为语言提示,指导多对象跟踪的预测。据我们所知,这是在视频中实现任意引用对象预测数的首项工作。为了推进RMOT,我们根据名为Refer-KITTI的KITTI, 构建了一个带有可缩放表达式的基准。具体地说,它提供了18部视频,有818个表达式,视频中的每个表达式都有说明,平均10.7个对象。此外,我们开发了一个基于变压器的 TransROMOT结构,以在线方式应对新任务,从而取得令人印象深刻的探测性能并超越了其他对应方。</s>