How can we teach a computer to recognize 10,000 different actions? Deep learning has evolved from supervised and unsupervised to self-supervised approaches. In this paper, we present a new contrastive learning-based framework for decision tree-based classification of actions, including human-human interactions (HHI) and human-object interactions (HOI). The key idea is to translate the original multi-class action recognition into a series of binary classification tasks on a pre-constructed decision tree. Under the new framework of contrastive learning, we present the design of an interaction adjacent matrix (IAM) with skeleton graphs as the backbone for modeling various action-related attributes such as periodicity and symmetry. Through the construction of various pretext tasks, we obtain a series of binary classification nodes on the decision tree that can be combined to support higher-level recognition tasks. Experimental justification for the potential of our approach in real-world applications ranges from interaction recognition to symmetry detection. In particular, we have demonstrated the promising performance of video-based autism spectrum disorder (ASD) diagnosis on the CalTech interview video database.
翻译:如何教计算机识别10,000个不同的动作?深度学习从有监督和无监督进化到了自监督的方法。在本文中,我们提出了一种新的基于对比学习的框架,用于基于决策树的动作分类,包括人际交互(HHI)和人-物交互(HOI)。关键思想是将原始的多类别动作识别转化为在预先构建的决策树上的一系列二进制分类任务。在对比学习的新框架下,我们提出了一个基于骨骼图的交互相邻矩阵(IAM)来模拟各种与动作相关的属性,如周期性和对称性。通过构建各种预文本任务,我们获得了决策树上的一系列二进制分类节点,可组合支持更高级别的识别任务。实证研究表明,我们的方法在从交互识别到对称性检测等现实世界应用方面具有潜力。特别是,在CalTech面试视频数据库上,我们证明了视频识别自闭症谱系障碍(ASD)的有前途的表现。