Scene understanding is crucial for autonomous systems which intend to operate in the real world. Single task vision networks extract information only based on some aspects of the scene. In multi-task learning (MTL), on the other hand, these single tasks are jointly learned, thereby providing an opportunity for tasks to share information and obtain a more comprehensive understanding. To this end, we develop UniNet, a unified scene understanding network that accurately and efficiently infers vital vision tasks including object detection, semantic segmentation, instance segmentation, monocular depth estimation, and monocular instance depth prediction. As these tasks look at different semantic and geometric information, they can either complement or conflict with each other. Therefore, understanding inter-task relationships can provide useful cues to enable complementary information sharing. We evaluate the task relationships in UniNet through the lens of adversarial attacks based on the notion that they can exploit learned biases and task interactions in the neural network. Extensive experiments on the Cityscapes dataset, using untargeted and targeted attacks reveal that semantic tasks strongly interact amongst themselves, and the same holds for geometric tasks. Additionally, we show that the relationship between semantic and geometric tasks is asymmetric and their interaction becomes weaker as we move towards higher-level representations.
翻译:对于打算在现实世界中运作的自主系统来说,单一任务视野网络只能根据场景的某些方面来提取信息。另一方面,在多任务学习(MTL)中,这些单一任务是共同学习的,从而为分享信息和获得更全面理解的任务提供了机会。为此,我们开发了UniNet,这是一个统一的场景理解网络,准确而有效地推断出重要视觉任务,包括物体探测、语义分割、实例分割、单眼深度估计和单眼实例深度预测。这些任务只根据不同的语义和几何学信息来提取信息。因此,理解任务之间的关系可以提供有用的提示,促进互补信息分享。我们通过对立攻击的镜头来评估UniNet的任务关系,其依据的概念是,它们可以利用在神经网络中学到的偏差和任务互动。在城市景象数据集上进行广泛的实验,使用非针对性和有针对性的攻击表明,语义任务本身之间有着密切的相互作用,而且对几何任务也有相同的搁置点。因此,我们对任务的理解可以提供有用的提示,使互补的信息分享成为有用的线索。我们通过对立面攻击的角度来评估UniNet中的任务关系,它们之间的比较脆弱地平面和地理表现。