Despite the recent progress of fully-supervised action segmentation techniques, the performance is still not fully satisfactory. One main challenge is the problem of spatiotemporal variations (e.g. different people may perform the same activity in various ways). Therefore, we exploit unlabeled videos to address this problem by reformulating the action segmentation task as a cross-domain problem with domain discrepancy caused by spatio-temporal variations. To reduce the discrepancy, we propose Self-Supervised Temporal Domain Adaptation (SSTDA), which contains two self-supervised auxiliary tasks (binary and sequential domain prediction) to jointly align cross-domain feature spaces embedded with local and global temporal dynamics, achieving better performance than other Domain Adaptation (DA) approaches. On three challenging benchmark datasets (GTEA, 50Salads, and Breakfast), SSTDA outperforms the current state-of-the-art method by large margins (e.g. for the F1@25 score, from 59.6% to 69.1% on Breakfast, from 73.4% to 81.5% on 50Salads, and from 83.6% to 89.1% on GTEA), and requires only 65% of the labeled training data for comparable performance, demonstrating the usefulness of adapting to unlabeled target videos across variations. The source code is available at https://github.com/cmhungsteve/SSTDA.
翻译:尽管在完全监督的行动分割技术方面最近取得了进展,但绩效仍然不完全令人满意。主要挑战之一是空间时空差异问题(例如,不同的人可能以不同方式执行同样的活动)。因此,我们利用未贴标签的视频解决这个问题,将行动分割任务重新定位为跨领域问题,由spastio-toporal变异造成领域差异。为了缩小差异,我们提议自我监督的Temal Domain适应(SSTDA),其中包括两个自监督的辅助任务(二进制和连续域预测),以将嵌入于本地和全球时间动态的跨域特征空间联合起来(例如,不同的人可能以不同的方式执行同样的活动)。因此,我们利用未贴标签的视频来解决这个问题,通过重塑行动分割任务,将其作为一个跨部域问题,解决由spastio-时间差异造成的域差异问题。为了减少差异,我们提议STDA(SS@25分数,从59.6%到69.1%在早餐上,从73.4%到8.5%在50Salads上,实现比其他DLILA/DLO的功能变量,从89.1%显示GA的可比较的版本数据。