We propose a new approach to Human Activity Evaluation (HAE) in long videos using graph-based multi-task modeling. Previous works in activity evaluation either directly compute a metric using a detected skeleton or use the scene information to regress the activity score. These approaches are insufficient for accurate activity assessment since they only compute an average score over a clip, and do not consider the correlation between the joints and body dynamics. Moreover, they are highly scene-dependent which makes the generalizability of these methods questionable. We propose a novel multi-task framework for HAE that utilizes a Graph Convolutional Network backbone to embed the interconnections between human joints in the features. In this framework, we solve the Human Activity Segmentation (HAS) problem as an auxiliary task to improve activity assessment. The HAS head is powered by an Encoder-Decoder Temporal Convolutional Network to semantically segment long videos into distinct activity classes, whereas, HAE uses a Long-Short-Term-Memory-based architecture. We evaluate our method on the UW-IOM and TUM Kitchen datasets and discuss the success and failure cases in these two datasets.
翻译:在长视频中,我们建议使用基于图形的多任务模型来进行人类活动评价(HAE)的新方法。以前的活动评价要么直接使用检测到的骨骼直接计算指标,要么使用现场信息来倒退活动评分。这些方法对于准确的活动评估来说是不够的,因为它们仅仅计算了一个剪辑的平均得分,而没有考虑到联合和身体动态之间的关联。此外,它们高度依赖场景,这使得这些方法的可概括性成问题。我们为HAE提出了一个新的多任务框架,它利用图表演变网络主干线将人类联合的互联嵌入特征中。在这个框架中,我们解决人类活动划分问题,作为改进活动评估的辅助任务。 头部受Encoder-Decoder Temal convolutional 网络的驱动,可以将长期的片段带进入不同的活动类别,而HAEE使用一个长期短期基于模型的结构。我们在UW-IM和TUM Kitchen数据集上评估我们的方法,并讨论这两个数据集的成功和失败案例。