Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark. Our code is available at https://github.com/jhcho99/gsrtr .
翻译:地面状况识别(GSR)的任务不仅将一个突出的行动(动词)分类,而且还预测与语义作用有关的实体(名词)及其在给定图像中的位置。受变异者在愿景任务中的显著成功启发,我们提议了一个基于变异器编码器-脱coder结构的GSR模型。我们模型的注意机制通过有效捕捉图像的高层次语义特征,使动词分类准确,并使模型能够灵活处理实体之间复杂和依赖图像的关系,以改进名词分类和本地化。我们的模型是GSR的第一个变异器结构,在SWiG基准的每一项评价指标中都达到了最新水平。我们的代码可在https://github.com/jhcho99/gstrtr上查阅。