Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model based on analog bits is used to model panoptic masks, with a simple, generic architecture and loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our generalist approach can perform competitively to state-of-the-art specialist methods in similar settings.
翻译:光学区段给图像的每个像素指定了语义和实例标识标签。 由于实例标识的变异也是有效的解决方案, 任务需要学习高维一对多个映射。 因此, 最先进的方法使用定制的架构和任务特定的损失函数。 我们将光学区段作为离散的数据生成问题, 而不依赖于任务的感应偏差。 基于模拟比特的传播模型用于模拟光学面罩, 使用简单、 通用的架构和损失功能。 通过简单地添加以往的预测作为调节信号, 我们的方法能够建模视频( 在流体设置中), 从而学习自动跟踪对象实例 。 通过广泛的实验, 我们证明我们的一般方法可以在类似环境下以竞争方式运行最先进的专家方法 。