How can we teach a computer to recognize 10,000 different actions? Deep learning has evolved from supervised and unsupervised to self-supervised approaches. In this paper, we present a new contrastive learning-based framework for decision tree-based classification of actions, including human-human interactions (HHI) and human-object interactions (HOI). The key idea is to translate the original multi-class action recognition into a series of binary classification tasks on a pre-constructed decision tree. Under the new framework of contrastive learning, we present the design of an interaction adjacent matrix (IAM) with skeleton graphs as the backbone for modeling various action-related attributes such as periodicity and symmetry. Through the construction of various pretext tasks, we obtain a series of binary classification nodes on the decision tree that can be combined to support higher-level recognition tasks. Experimental justification for the potential of our approach in real-world applications ranges from interaction recognition to symmetry detection. In particular, we have demonstrated the promising performance of video-based autism spectrum disorder (ASD) diagnosis on the CalTech interview video database.
翻译:如何教计算机识别10,000种不同的动作?深度学习已从监督和无监督演变为自监督方法。在本文中,我们提出了一种新的基于对比学习的框架,用于决策树分类,包括人-人交互(HHI)和人-物交互(HOI)。关键思想是将原始的多类动作识别转换为预构建决策树上一系列的二进制分类任务。在对比学习的新框架下,我们提出了一种交互邻接矩阵(IAM),以骨架图为骨架,用于建模各种与动作相关的属性,如周期性和对称性。通过构建各种预文本任务,我们获得了一系列二进制分类节点,在决策树上可以组合支持高级别识别任务。我们通过实验证明了我们的方法在从交互识别到对称性检测的真实世界应用中的潜力,特别是我们在CalTech访谈视频数据库上展示了基于视频的自闭症谱系障碍(ASD)诊断的有前途的性能。