We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http://epic-kitchens.github.io/VISOR
翻译:我们引入了VISOR, 这是一个新的像素说明数据集, 以及一个以自我为中心的视频中分割手和活性物体的基准套件。 VISOR 将 EPIC-KITCHENS 的视频标注为 VIPIC-KITCHENS, 它带来了在目前视频分割数据集中未遇到的新的挑战。 具体地说,我们需要确保像素层次说明的短期和长期一致性,因为物体要进行变革性互动,例如, 洋葱被剥皮、 裁剪和制成—— 我们的目标是获得皮、 皮片、 切片、 切片板、 刀 、 锅 以及表演手 的精确像素级说明。 VISOR 引入了批注管道、 AI- 动力部分、 缩放性和质量。 总之, 我们公开发布了 272K 277 个257 个物体类的手动静音面具、 9.9M 内聚密面罩、 67K 手提物体关系, 覆盖了 179 个小时的未剪片视频视频。 除了这些说明之外, 我们在视频对象、 互动理解和长期推理中提出了三项的三个挑战。 关于数据、 、 代码和领导板上: http/ http 。