Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
翻译:行动质量评估(AQA)对于行动理解和解决任务十分重要,由于视觉差异微妙,任务具有独特的挑战。现有的最先进的方法通常依赖于用于评分回归或排名的整体视频演示,这限制了对细细分类内部差异的概括化。为了克服上述限制,我们建议使用一个时间分解变压器,将整体特征分解为时间部分层面的演示。具体地说,我们利用一套可学习查询查询来代表特定行动的原子时间模式。我们的解码进程将框架表述转换成固定数量的按时间顺序排列的部分表述。为了获得质量评分,我们采用了基于部分表述的全局对比回归。由于现有的AQA数据集不提供时间部分等级标签或分区,我们提议了两个新的关注跨度反应损失功能:排序损失以确保可学习查询满足交叉关注的时间顺序,而缩放损失则鼓励部分表述更具歧视性。广泛的实验表明,我们提出的方法比前三个基准有相当大的差距。