Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local embeddings and a color-type prompt embedding are extracted to represent various granularities of semantic features. Finally, the overall framework is optimized by a cross-modal multi-granularity contrastive loss function. Experiments demonstrate the effectiveness of our method. Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are available at https://github.com/dyhBUPT/OMG.
翻译:通过自然语言描述检索履带车辆在智能城市建设中发挥着关键作用,目的是在监控视频中找到一组跟踪车辆所提供文本的最佳匹配文本。现有工作一般通过双流框架解决,由文字编码器、视觉编码器和跨模式损失功能组成。虽然取得了一些进展,但它们未能充分利用不同层次的颗粒特征的信息。为了解决这一问题,我们提议了一个基于自然语言的车辆检索任务的新框架,即OMG,在视觉表述、文字表述和客观功能方面观测多种颗粒。视觉表述、目标特征、上下文特征和运动特征则分别编码。对于文字表述、一个全球嵌入、三个地方嵌入和一个颜色型快速嵌入,以代表各种语系特征的颗粒特性。最后,我们提议了一个基于自然语言的车辆检索任务的新框架,即OMG,在视觉表述、文字表述和客观功能方面观测到多种颗粒质。关于视觉表述、目标特征、环境特征和运动特征的图示,将单独编码。对于文本、目标表达、一个全球嵌入、三个地方嵌和颜色型快速嵌入,以代表各种语系特征。最后,整个框架通过跨式多式多语系的多语系对比损失功能进行优化。实验,以展示我们的方法展示方法的有效性。我们的方法。我们OMGMGMGMGOGOD在前的轨道/Greastrs