Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections. Assembly101 is the first multi-view action dataset, with simultaneous static (8) and egocentric (4) recordings. Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses. We benchmark on three action understanding tasks: recognition, anticipation and temporal segmentation. Additionally, we propose a novel task of detecting mistakes. The unique recording format and rich set of annotations allow us to investigate generalization to new toys, cross-view transfer, long-tailed distributions, and pose vs. appearance. We envision that Assembly101 will serve as a new challenge to investigate various activity understanding problems.
翻译:Assembly101是一个新的程序活动数据集,有4321个视频,显示人们聚集和拆卸101辆“吞食”玩具,参与者在没有固定指示的情况下工作,序列在行动顺序、错误和校正方面有丰富和自然的变化。Assembly101是第一个多视角行动数据集,同时有静态(8)和以自我为中心的(4)录音。配有100公里粗糙和1米微小的动作部分以及18M 3D手相片的附加说明。我们以三种行动理解任务为基准:承认、预期和时间分割。此外,我们提出发现错误的新任务。独特的记录格式和丰富的说明使我们能够调查对新玩具的概括化、交叉视图转移、长尾的分布和外观。我们设想,Assembly101将成为调查各种活动理解问题的新挑战。