We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task. In the first stage, we propose an Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) to temporally ground the target moment corresponding to the given description. Primarily, we improve the original 2D-TAN from two aspects: First, a temporal context-aware Bi-LSTM Aggregation Module is developed to aggregate clip-level representations, replacing the original max-pooling. Second, we propose to employ Random Concatenation Augmentation (RCA) mechanism during the training phase. In the second stage, we use pretrained MDETR model to generate per-frame bounding boxes via language query, and design a set of hand-crafted rules to select the best matching bounding box outputted by MDETR for each frame within the grounded moment.
翻译:我们建议采取有效的两阶段办法,解决基于语言的以人为中心的时空空间视频定位(HC-STVG)任务问题。在第一阶段,我们提议扩大2D时间相邻网络(Agmented 2D-TAN),以在时间上设定与给定描述相对应的目标时刻。我们主要从两个方面改进原来的2D-TAN:首先,开发一个具有时间背景觉悟的Bi-LSTM聚合模块,以汇总时间级代表,取代原来的最大集合。第二,我们提议在培训阶段采用随机聚集(RCA)机制。在第二阶段,我们使用经过预先训练的MDERTR模型,通过语言查询生成每个框架的捆绑框,并设计一套手工设计的规则,以选择由MDETR在简易时刻为每个框架产出的最佳匹配框。