Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image. However, the performance is not reliable for images with challenging factors, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can make full use of temporal information. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames. To better utilize the semantic information, we take the attribute list as another input and transform the attribute words/phrase into the corresponding sentence via split, expand, and prompt. Then, the text encoder of CLIP is utilized for language embedding. The averaged visual tokens and text tokens are concatenated and fed into a fusion Transformer for multi-modal interactive learning. The enhanced tokens will be fed into a classification head for pedestrian attribute prediction. Extensive experiments on a large-scale video-based PAR dataset fully validated the effectiveness of our proposed framework.
翻译:现有的行人属性识别(PAR)算法主要基于静态图像开发。然而,对于具有挑战性因素的图像,如严重遮挡、运动模糊等,性能不可靠。在这项工作中,我们提出使用视频帧来理解人类属性,这可以充分利用时间信息。具体而言,我们将基于视频的PAR作为视觉语言融合问题,并采用预训练的大模型CLIP来提取给定视频帧的特征嵌入。为了更好地利用语义信息,我们将属性列表作为另一种输入,并通过分裂、扩展和提示将属性词/短语转换为相应的句子。然后,利用CLIP的文本编码器进行语言嵌入。平均视觉标记和文本标记被连接并输入融合变压器进行多模态交互式学习。增强的标记将被馈入用于行人属性预测的分类头。在大规模基于视频的PAR数据集上的广泛实验充分验证了我们提出的框架的有效性。