基于多模态思维链推理的可解释动作形态评估 (Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning)

Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

翻译：在实际场景中，评估人体动作是否规范并提供合理的反馈以提升动作标准化水平至关重要，但也极具挑战性。然而，当前的视频理解方法主要关注动作的内容与位置，无法满足这一需求。同时，现有数据集大多缺乏表征动作标准化程度的标签，且动作质量评估数据集缺乏可解释性与详细反馈。为此，我们定义了一项新的人体动作形态评估任务，并引入了一个多样化的新数据集CoT-AFA，该数据集包含大规模健身与武术视频，并带有用于全面视频分析的多层级标注。我们通过一种新颖的思维链解释范式对CoT-AFA数据集进行了丰富。与提供孤立反馈不同，我们的解释提供了一个完整的推理过程——从识别动作步骤到分析其结果，再到提出具体解决方案。此外，我们提出了一个名为可解释健身评估器的框架，该框架不仅能评判动作，还能解释原因并提供解决方案。该框架采用两条并行处理流和一个动态门控机制来融合视觉与语义信息，从而增强其分析能力。实验结果表明，我们的方法在解释生成（如CIDEr指标提升+16.0%）、动作分类（准确率提升+2.7%）和质量评估（准确率提升+2.1%）方面均取得了进步，揭示了CoT-AFA在未来研究中的巨大潜力。我们的数据集与源代码已公开于https://github.com/MICLAB-BUPT/EFA。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日