In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and with temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.
翻译:在这份技术报告中,我们代表了第四人背景(PIC)讲习班的以人为中心的时空空间视频定位轨迹(HC-STVG)的解决方案和挑战,我们的解决办法建立在TubeDETR和相互匹配网络(MMN)的基础上。具体地说,TubeDETR开发了一个视频文本编码器和一个时空解码器,以预测目标人的起始时间、结束时间和管子。MMN在图像中探测人,将他们作为管子连接起来,提取个人管的特征和文字描述,并预测他们之间在选择最可能的人管作为基结果方面的相似之处。我们的解决办法最终通过将MMND和TubeDETR的时地本地化结合,对结果进行了微调。在HC-STVG第4个挑战的轨迹中,我们的解决办法达到了第三位。