Although vital to computer vision systems, few-shot action recognition is still not mature despite the wide research of few-shot image classification. Popular few-shot learning algorithms extract a transferable embedding from seen classes and reuse it on unseen classes by constructing a metric-based classifier. One main obstacle to applying these algorithms in action recognition is the complex structure of videos. Some existing solutions sample frames from a video and aggregate their embeddings to form a video-level representation, neglecting important temporal relations. Others perform an explicit sequence matching between two videos and define their distance as matching cost, imposing too strong restrictions on sequence ordering. In this paper, we propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions. CMOT simultaneously considers semantic and temporal information in videos under Optimal Transport framework, and is discriminative for both content-sensitive and ordering-sensitive tasks. In detail, given two videos, we sample segments from them and cast the calculation of their distance as an optimal transport problem between two segment sequences. To preserve the inherent temporal ordering information, we additionally amend the ground cost matrix by penalizing it with the positional distance between a pair of segments. Empirical results on benchmark datasets demonstrate the superiority of CMOT.
翻译:尽管对计算机视觉系统至关重要,尽管对微小图像分类进行了广泛研究,但微小的动作识别仍然不够成熟。一般的微小学习算法从可见的类中提取可转移的嵌入,并通过建造一个基于标准的分类器将其再用于隐蔽的类中。应用这些算法在行动识别中的一个主要障碍是视频的复杂结构。从视频和集成的嵌入框中有些现有的解决办法样本框架,形成视频级代表制,忽视重要的时间关系。另一些则对两个视频进行明确的序列匹配,将其距离定义为匹配成本,对顺序排序施加了过于严格的限制。在本文中,我们提议通过优化的运输(CMOT)对这两个解决方案的优势进行配对。CMOT同时考虑在优化运输框架下的视频中的语义性和时间信息,对内容敏感和订购敏感的任务进行歧视。详细说来,我们从两个视频中抽取了两个部分,并将其距离的计算为两个部分之间的最佳运输问题。为了保存固有的时间排序信息,我们提议通过优化的定序(CMOM)对地面成本矩阵进行修正,同时展示其位置的距离数据。