In this paper, a new video classification methodology is proposed which can be applied in both first and third person videos. The main idea behind the proposed strategy is to capture complementary information of appearance and motion efficiently by performing two independent streams on the videos. The first stream is aimed to capture long-term motions from shorter ones by keeping track of how elements in optical flow images have changed over time. Optical flow images are described by pre-trained networks that have been trained on large scale image datasets. A set of multi-channel time series are obtained by aligning descriptions beside each other. For extracting motion features from these time series, PoT representation method plus a novel pooling operator is followed due to several advantages. The second stream is accomplished to extract appearance features which are vital in the case of video classification. The proposed method has been evaluated on both first and third-person datasets and results present that the proposed methodology reaches the state of the art successfully.
翻译:在本文中,提出了一个新的视频分类方法,可以在第一和第三人视频中同时应用。拟议战略的主要理念是,通过在视频上执行两个独立的流流来有效捕捉外观和运动的补充信息。第一个流的目的是通过跟踪光学流图中元素随时间变化的情况,从较短的流中捕捉长期的动作。在大型图像数据集方面受过培训的经过预先训练的网络描述了光流图像。通过对描述进行对齐,获得了一套多通道时间序列。为提取这些时间序列的动作特征,由于一些优势,PoT代表法和新颖的集合操作器被采用。第二个流是为了提取在视频分类方面至关重要的外观特征。拟议的方法在第一和第三人数据集上都得到了评价,结果显示拟议方法成功到达了艺术状态。