There has been an increasing interest in multi-task learning for video understanding in recent years. In this work, we propose a generalized notion of multi-task learning by incorporating both auxiliary tasks that the model should perform well on and adversarial tasks that the model should not perform well on. We employ Necessary Condition Analysis (NCA) as a data-driven approach for deciding what category these tasks should fall in. Our novel proposed framework, Adversarial Multi-Task Neural Networks (AMT), penalizes adversarial tasks, determined by NCA to be scene recognition in the Holistic Video Understanding (HVU) dataset, to improve action recognition. This upends the common assumption that the model should always be encouraged to do well on all tasks in multi-task learning. Simultaneously, AMT still retains all the benefits of multi-task learning as a generalization of existing methods and uses object recognition as an auxiliary task to aid action recognition. We introduce two challenging Scene-Invariant test splits of HVU, where the model is evaluated on action-scene co-occurrences not encountered in training. We show that our approach improves accuracy by ~3% and encourages the model to attend to action features instead of correlation-biasing scene features.
翻译:近年来,人们越来越关注多任务学习,以了解视频。在这项工作中,我们提出多任务学习的普遍概念,将模式应良好执行的辅助任务和模式不应良好执行的对抗性任务都纳入其中。我们采用必要的条件分析(NCA)作为数据驱动的方法,以决定这些任务应属于什么类别。我们的新颖的拟议框架,即反向多任务神经网络(AMT),惩罚国家空间活动委员会确定为全视视频理解数据集中现场识别的对抗性任务,以改进行动识别。这提升了共同的假设,即应当始终鼓励模式在多任务学习中做好所有任务。同时,AMT仍然保留多任务学习作为现有方法的概括的所有好处,并将目标确认作为援助行动识别的辅助任务。我们引入了两种挑战性的HVU的Sene-Invaricant测试分解,在全视视频理解数据集中评估了模型的现场识别,以提高行动识别度,从而改进了模型的精确度,而不是在培训中遇到的方位特征。我们展示了我们的行动的精确性做法。