In order for artificial agents to successfully perform tasks in changing environments, they must be able to both detect and adapt to novelty. However, visual novelty detection research often only evaluates on repurposed datasets such as CIFAR-10 originally intended for object classification, where images focus on one distinct, well-centered object. New benchmarks are needed to represent the challenges of navigating the complex scenes of an open world. Our new NovelCraft dataset contains multimodal episodic data of the images and symbolic world-states seen by an agent completing a pogo stick assembly task within a modified Minecraft environment. In some episodes, we insert novel objects of varying size within the complex 3D scene that may impact gameplay. Our visual novelty detection benchmark finds that methods that rank best on popular area-under-the-curve metrics may be outperformed by simpler alternatives when controlling false positives matters most. Further multimodal novelty detection experiments suggest that methods that fuse both visual and symbolic information can improve time until detection as well as overall discrimination. Finally, our evaluation of recent generalized category discovery methods suggests that adapting to new imbalanced categories in complex scenes remains an exciting open problem.
翻译:为了使人工智能代理能够在不断变化的环境中成功执行任务,它们必须能够检测并适应新颖性。然而,视觉新颖性检测的研究往往只在原本用于对象分类的重复使用数据集上进行评估,例如CIFAR-10,其中的图像聚焦于一个明显的,中心化的对象。需要新的基准来代表在开放世界的复杂场景中导航的挑战。我们的新数据集 NovelCraft 包括一个利用修改过的Minecraft环境完成弹跳棒装配任务的代理所看到的图像和符号世界状态的多模式情境数据。在某些情景中,我们会在复杂的3D场景中插入大小不同的新颖性对象,这可能会影响游戏过程。我们的视觉新颖性检测基准发现,在控制误报的情况下,常用的 area-under-the-curve 指标表现最佳的方法可能会被更简单的替代方案取代。进一步的多模式新颖性检测实验表明,融合视觉和符号信息的方法可以提高检测时间和整体区分度。最后,我们对最近的广义类别发现方法进行的评估表明,适应复杂场景中新的、不平衡的类别仍然是一个令人兴奋的未解决问题。