Despite Temporal Sentence Grounding in Videos (TSGV) has realized impressive progress over the last few years, current TSGV models tend to capture the moment annotation biases and fail to take full advantage of multi-modal inputs. Miraculously, some extremely simple TSGV baselines even without training can also achieve state-of-the-art performance. In this paper, we first take a closer look at the existing evaluation protocol, and argue that both the prevailing datasets and metrics are the devils to cause the unreliable benchmarking. To this end, we propose to re-organize two widely-used TSGV datasets (Charades-STA and ActivityNet Captions), and deliberately \textbf{C}hange the moment annotation \textbf{D}istribution of the test split to make it different from the training split, dubbed as Charades-CD and ActivityNet-CD, respectively. Meanwhile, we further introduce a new evaluation metric "dR@$n$,IoU@$m$" to calibrate the basic IoU scores by penalizing more on the over-long moment predictions and reduce the inflating performance caused by the moment annotation biases. Under this new evaluation protocol, we conduct extensive experiments and ablation studies on eight state-of-the-art TSGV models. All the results demonstrate that the re-organized datasets and new metric can better monitor the progress in TSGV, which is still far from satisfactory. The repository of this work is at \url{https://github.com/yytzsy/grounding_changing_distribution}.
翻译:尽管TSGV(TSGV)在过去几年里取得了令人印象深刻的进展,但目前的TSGV模型往往能够捕捉到当时的批注偏差,无法充分利用多模式投入。 光彩地看,一些即使没有培训的极简单 TSGV 基线也可以达到最先进的性能。 在本文中,我们首先更仔细地审视现有的评估协议,并争论现有的数据集和衡量标准都是导致基准设定不可靠的弊端。 为此,我们提议重新组织两个广泛使用的 TSGV 数据集( Charaddes-STA 和 ActionNet Captions ), 并且有意地去组织两个广泛使用的 TSG 批注的批注偏差, 并且没有充分利用多模式。 测试分解的瞬间点使得它与培训分解不同, 分别被形容为Charades- CD 和活动Net- CD CD CD 。 同时, 我们进一步引入一个新的评估标准“ dR@$@ comself$, IOU@m$$" 。 我们提议重新组织两个广泛使用的 TSOBLOFlevalalal adalevalation adview adview adviewdal deview admental deal deal deal deal demomental demomental disal deal disal dromomental dromomental droisal maisal roisal roismomental 。