We study the problem of detecting talking activities in collaborative learning videos. Our approach uses head detection and projections of the log-magnitude of optical flow vectors to reduce the problem to a simple classification of small projection images without the need for training complex, 3-D activity classification systems. The small projection images are then easily classified using a simple majority vote of standard classifiers. For talking detection, our proposed approach is shown to significantly outperform single activity systems. We have an overall accuracy of 59% compared to 42% for Temporal Segment Network (TSN) and 45% for Convolutional 3D (C3D). In addition, our method is able to detect multiple talking instances from multiple speakers, while also detecting the speakers themselves.
翻译:我们的研究是在合作学习录像中探测谈话活动的问题。我们的方法是使用光学流动矢量的日志放大率进行头部探测和预测,将问题简化为小型投影图像的简单分类,而无需培训3D活动分类系统。然后,小投影图像很容易使用标准分类器的简单多数票进行分类。在探测谈话时,我们建议的方法明显优于单一活动系统。我们的总精确度为59%,而Temalal Copein Network (TSN)为42%,而C3D(C3D)为45%。此外,我们的方法能够从多个发言者那里探测到多个谈话事件,同时也能探测发言者本身。